6c  ADDRESS  ( Oty ;  Stete,  end  ZIP  Code) 

Fort  Collins,  CO 

8a.  NAME  OF  FUNDING /SPONSORING 

8b.  OFFICE  SYMBOL 

ORGANI2ATION 

(If  eppllcoble) 

AFOSR/NM 

6c  ADORESS  (Oty,  Stete,  end  ZIP  Code) 

1F0SR/W  ___ 

Bolling  AFBDC  ^  30332-6448 

7b.  ADDRESS  (Oty,  Stoic,  and  ZIP  Cod*) 

Arosn/im  _ 

Bolling  AfB  00  ’  20332-6440 


AFOSR-86-0070 


10.  SOURCE  OF  FUNDING  NUMBERS 


omputer  Science  and  Statistics:  18th  Symposium  on  the  Interface 


12.  PERSONAL  AUTHOR(S) 

Thomas  J.  Boradman 


13«.  TYPE  OF  REPORT 
Final 


16.  SUPPLEMENTARY  NOTATION 


14.  DATE  OF  REPORT  (Year,  Month,  Day)  115.  PAGE  COUNT 

_ 87/08/26 


COSATI  CODES 


GROUP  SUB-GROUP 


18  SUBJECT  TERMS  (Continue  on  reverse  if  necessary  and  identify  by  block  number) 


19.  ABSTRACT  (Continue  on  reverse  If  necessary  and  identify  by  block  number) 

This  report  contains  the  proceedings  of  a  conference  on  the  interface  between  computer 
science  and  statistics.  The  conference  "Computer  Science  and  Statistics:  18th  Symposium 
on  the  Interface"  was  held  19-21  March  1986  at  Fort  Collins,  CO. 


20.  DISTRIBUTION  /AVAILABILITY  OF  ABSTRACT 

UNCLASSIFIEDIUNLIMITED  □  SAME  AS  RPT 


22a.  NAME  OF  RESPONSIBLE  INDIVIDUAL 

Br Van  Woodruff,  Mai,  USAF 


21  ABSTRACT  SECURITY  CLASSIFICATION 

Unclassified 


22b.  TELEPHONE  (Include  Are*  Code)  1 22c.  OFFICE  SYMBOL 

.  AFQSR/NM 


<s> 


COMPUTER  SCIENCE 
AND  STATISTICS 

PROCEEDINGS  OF  THE  18th  SYMPOSIUM 
ON  THE  INTERFACE 


AFOAiK-  fK-  0  ^  ...  o  1 5  3 


Fort  Collins,  Colorado,  March  1986 


Editor 

THOMAS  J.  BOARDMAN 

Colorado  State  University 
Fort  Collins,  Colorado,  U.S.A. 


Assistant  Editor 

IRENE  M.  STEFANSKI 

American  Statistical  Association 
Washington,  D.C.,  U.S.A. 


C>  ~ r  z'  '■>  — 1 1  5K  > 

?  >  ‘ :  ri  r'  9  » 

Ik  -  A  '  .J  -vi 

•  n  a-  c  o  o 

— c:  o  •-  M  » 
£  trj  q.  c  :/  r>  o 

3-  s  f,-  n  ro 

a  ~  --  ■■■  }J  — <  O 

o'  ^  v'  -C  —  •*;  Z\ 
~  > 

s- 3  re 

3  ^  o  ^  ;  : 
a  o.  «  »  ■  *  60 

&.  •  8  >  ?2 

o  >  g  o  § 

SE'  <  O  n 

v>  >  <  r!  o 

I 


l! 

[.< 

r.i» 

IM 


ASA 


DTIC 

electei 


FEB  2  9  1988 


D 


1986 


AMERICAN  STATISTICAL  ASSOCIATION,  WASHINGTON,  D.C. 


88  2  _  2§  127 


jo 

ft 

ft 


The  papers  and  discussions  in  this  Proceedings  volume  are  reproduced  exactly  as 
received  from  the  authors.  None  of  the  papers  has  been  submitted  to  a  refereeing 
process.  However,  the  authors  have  been  encouraged  to  have  their  papers  reviewed  by 
a  colleague  prior  to  final  preparation.  These  presentations  are  presumed  to  be 
essentially  as  given  at  the  18th  Symposium  on  the  Interface.  This  Proceedings  volume 
is  not  copyrighted  by  the  Association;  hence,  permission  for  reproduction  must  be 
obtained  from  th$-autl>or,  who  holds  the  rights  under  the  copyright  law. 

Authors  in  these  Proceedings  are  encouraged  to  submit  their  papers  to  any  journal  of 
their  choice.  The  ASA  Board  of  Directors  has  ruled  that  publication  in  the  Proceedings 
does  not  preclude  publication  elsewhere. 


American  Statistical  Association 
806  15th  Street,  N.W.,  Suite  640 
Washington,  DC  20005 


PRINTED  IN  THE  U.S.A. 


PREFACE 


The  18th  Symposium  on  the  Interface  between  Statistics  and  Computer  Science 
follows  in  a  long-standing  series  designed  to  provide  a  forum  for  numerical  analysts, 
statisticians,  and  computer  scientists  to  meet,  listen,  and  discuss  topics  of  mutual 
interest. 

The  18th  Symposium  was  held  at  the  University  Park  Holiday  Inn,  Fort  Collins, 
Colorado,  on  March  19-21,  1986,  hosted  by  Colorado  State  University. 

The  registrants  reflected  the  international  nature  of  the  Symposium.  Among  the 
300-plus  registrants  were  statisticians,  computer  scientists,  numerical  analysts, 
combinations  of  the  preceding,  and  others  from  most  states  in  the  U.S.,  several  of  the 
provinces  in  Canada,  and  six  other  countries.  The  organizers  were  pleased  with  the 
distribution  and  magnitude  of  the  attendance. 

The  official  cosponsors  of  the  18th  Symposium  were  Colorado  State  University,  the 
Statistical  Computing  Section  of  the  American  Statistical  Association  (ASA),  the 
International  Association  for  Statistical  Computing,  and  the  Colorado-Wyoming 
Chapter  of  ASA.  After  many  years  of  discussion,  the  proceedings  are  being  published 
by  ASA  and  made  available  the  same  year  as  the  Symposium. 

On  March  19,  1986,  two  short  courses  were  conducted  at  the  Holiday  Inn.  Peter 
Lewis  presented  material  on  his  "Advanced  Simulation  and  Statistics  Package"  from  9 
a.m.  to  4  p.m.  At  the  same  time  several  representatives  from  TCI  Software 
demonstrated  the  "T^  Scientific  Word  Processing  System."  Registration  began  at  6 
p.m.  and  continued  during  the  beginning  of  the  Welcoming  Reception/Mixer  that  was 
held  in  the  Fountain  Court  of  the  hotel.  The  keynote  address  by  John  W.  Tukey, 
entitled  "The  Interface  with  Computing:  In  the  Small  or  In  the  Large,"  opened  the 
Symposium  on  Thursday  morning.  Thereafter,  three  invited  and  one  contributed 
sessions  continued  on  Thursday  and  Friday.  The  Holiday  Inn  staff  prepared  delightful 
buffet  luncheons  on  Thursday  and  Friday,  which  provided  the  registrants  with  ample 
time  to  pursue  other  interests  and  carry  on  further  discussion  over  lunch.  During  the 
conference  eleven  firms  took  the  opportunity  to  exhibit  their  software  or  materials  (see 
the  list  on  page  v).  The  exhibit  room  also  served  as  the  coffee  break  location,  thus 
increasing  the  traffic  to  the  exhibit  area.  As  one  might  expect,  much  of  the  true 
interface  of  information  occurred  during  the  informal  discussions  and  demonstrations. 
Our  organizing  committee  made  efforts  to  provide  facilities  and  opportunities  for  these 
activities.  We  encourage  future  Symposium  organizers  to  expand  on  these  opportunities. 


ion  For 


As  chairman  of  the  Symposium,  I  gratefully  acknowledge  the  help  provided  me  by  on  F 
Jim  zumBrunnen,  the  vice  chairman,  and  Marilyn  Lesh,  our  secretary.  Without  their  GRAM 
encouragement  and  support,  the  success  of  the  Symposium  would  have  been  in  question,  tB 
Then,  too,  we  must  reserve  special  thanks  to  the  program  committee  who  organized  the  meed 
invited  sessions;  their  efforts  to  secure  speakers  and  topics  made  the  program  catior 

outstanding.  The  committee  consisted  of:  Daniel  B.  Carr,  Pacific  Northwest  — — 

Laboratory;  Paula  Cowley,  Pacific  Northwest  Laboratory;  James  Dolby  (deceased),  Los  _ 

Altos,  California;  William  Eddy,  Carnegie-Mellon  University;  Dennis  Friday,  National  utioi 
Bureau  of  Standards;  Richard  Jones,  University  of  Colorado;  William  Kennedy,  Iowa  bill' 
State  University;  John  Nash,  BYTE  and  University  of  Ottawa;  Wesley  Nicholson,  Pacific 
Northwest  Laboratory;  Robert  Schnabel,  University  of  Colorado;  G.W.  "Pete"  Stewart, 
University  of  Maryland;  Paul  Switzer,  Stanford  University;  Mike  Tarter,  University  of  pcc 
California,  Berkeley;  Bob  Teitel,  Teitel  Data  Systems;  John  Tukey,  Princeton  » 

University;  and  Paul  Velleman,  Cornell  University.  I  k^\  I 
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Local  arrangements  were  carried  out  by  Colorado  State  University's  Office  of 
Conference  Services.  Jill  Lancaster  and  her  staff  did  an  outstanding  job  with  all  of  the 
many  details  in  running  our  Symposium.  The  staff  at  the  Holiday  Inn  under  the  able 
leadership  of  Jane  Folsom  made  everyone  feel  welcome.  As  1  have  learned  from  Dr.  W. 
Edwards  Deming,  when  people  know  what  their  job  is  and  what  their  customers  expect, 
it  is  not  uncommon  to  find  that  quality  service  can  result.  The  hotel  and  Conference 
Service  staff  knew  our  needs  and  exceeded  our  expectations.  We  praise  them  for  their 
efforts. 

As  indicated  on  page  v  of  these  proceedings,  financial  support  for  the  Symposium 
was  made  possible  by  three  organizations.  When  bringing  a  diverse  group  of  people 
together,  travel  and  related  expenses  must  be  covered  for  some  of  the  participants. 

This  support  was  essential.  In  addition,  this  year  for  the  first  time  we  offered  student 
fellowships  for  graduate  students  in  statistics  or  computer  science.  I  hope  that  this 
model  will  be  continued  in  the  future. 

I  offer  thanks  to  the  many  other  Colorado  State  University  staff,  students,  and 
faculty  members  who  helped  to  make  this  Symposium  possible.  I  would  like  to  thank 
Randall  Spoeri,  Associate  Executive  Director  of  ASA,  whose  advice  and  counsel  was 
sought  often  and  was  always  helpful.  Finally  on  behalf  of  all  of  the  contributors  to  this 
volume,  I  wish  to  thank  Irene  M.  Stefanski,  ASA  Publications  Manager,  who  served  as 
Assistant  Editor  for  these  proceedings,  and  Debra  B.  Shapiro,  Publications  Assistant. 

Thomas  J.  Boardman, 

Editor 
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THE  INTERFACE  WITH  COMPUTING:  in  the  small  or  IN  THE  LARGE?* 

John  W.  Tukey,  Princeton  University,  Princeton,  NJ  08544 


If  “the  interface”  is  to  be  a  real  interface,  it  needs  to 
discuss  problems  that  are  real  on  both  sides.  If  statisticians 
are  to  help  the  practice  of  analyzing  data  to  even  approach 
an  economic  balance  between  the  decreasing  costs  of  com¬ 
putation  and  the  advancing  salaries  of  statistical  consultants 
(and  of  statistical  theorists),  they  must  ask  their  computer 
-  -  and  soon  their  workstation  -  -  for  much  larger  and  more 
important  computing  tasks. 

Either  of  these  points  leave  statisticians  seeking  to  for¬ 
mulate  problems  very  much  larger  than  we  have  been  accus¬ 
tomed  to  consider  -  -  large  enough  to  be  a  real  strain  for 
computing-science  specialists  to  consider  (and  often  too  hard 
for  statisticians  alone.) 

We  must  also  strain  the  statisticians!  Problems  that  will 
challenge  the  computer  scientists  will  not  be  found  by  sticking 
to  simplicity,  where  simplicity  is  inappropriate!  I  shall  return, 
shortly,  to  this  challenge  to  statisticians,  but  before  1  do,  let 
me  illustrate  a  few  larger  problems  of  computation. 

There  are  a  variety  of  major  problem-classes,  including 
factorial  data,  and  regression,  where  minimizing  the  sum  of 
absolute  (values  of)  residuals,  conveniently  “minimizing  the 
i'-norm”  is  a  useful  intermediate  step  to  what  we  really  want 
to  do.  (Simplicity  is  already  disappearing).  When  we  are 
making  a  linear  fit,  the  minimum-/.1  fit  is  rarely  unique. 

Rather  there  is  a  convex  subset  of  parameter  space, 
bounded  by  hyperplanes,  throughout  which  the  L'-norm  is 
constant  at  its  minimum.  In  moving  toward  uniqueness,  it  is 
natural  to  first  focus  on  the  vertices  (the  corner  points)  of  this 
convex  set,  of  which  there  may  be  many.  (Each  vertex  turns 

out  to  correspond  to  a  fit  with  many  exact-zero  residuals.) 
It  might  be  helpful  to  get  a  list  of  all  these  vertices;  it  might 
be  even  more  helpful  to  get  a  list  of  those  in  a  subset  which 
also  makes  extreme  some  second  criterion  (maximizing  the 
i'-norm  is  often  handy).  Work  by  Peter  Rousseeuw  in  the 


linear  regression  case,  and  by  Eugene  Johnson  in  the  facto¬ 
rial  analysis  case  has  made  important  progress,  but  1  suspect 
our  computing  science  colleagues  can  help  us  further,  if  only 
about  how  to  do  the  generalizations  to  non-linear  fits. 

Configural  polysampling  (Tukey  1986),  in  which  differ¬ 
ent  weighting  schemes  allow  a  single  set  of  sample  configura¬ 
tions  to  be  honestly  taken  as  representing  each  of  a  few  -  -  not 
just  one  -  -  tastefully  selected  parent  distributional  situations, 
is  today  our  only  effective  way  to  learn  about  the  possibilities 
of  robustness  in  real,  finite  samples.  To  make  things  work  we 
need  to  evaluate  a  few  more-dimensional  integrals  for  each  of 
a  few  parent  situations,  doing  this  for  each  of  several  hundred 
configurations.  This  can  easily  mean  10, 000  multidimensional 
quadratures,  of  smooth  but  somewhat  nasty  integrands! 

Location  and  scale  together  call  for  two  dimensional  in¬ 
tegrals;  fit  ting  straight  lines  calls  for  three-dimensional  ones; 
regressions  with  3  or  more  coefficients  call  for  numerical  inte¬ 
gration  in  even  more  dimensions.  At  the  moment,  by  work¬ 
ing  hard,  one  can  do  a  probably  satisfactory  job  in  two- 
dimensions,  and  dream  of  doing  the  same  in  three  dimen¬ 
sions. 

It  is  not  satisfactory  to  “compute  to  death”  such  a  prob¬ 
lem  by  adopting  integration  procedures  that  evaluate  the  in¬ 
tegrand  at  many,  many  points!  In  four  dimensions,  100,000 
evaluations,  at  100  or  so  arithmetic  operations  per  evaluation 
would  only  be  10  million  or  so  arithmetic  operations  per  in¬ 
tegral,  taking  a  minute  on  a  reasonably  available  computer. 
For  one  integral  this  is  fine;  for  10,000  integrals  its  a  week’s 
work  and  one  more  dimension  will  surely  ruin  us!  What  to 
do? 

If  we  only  want  a  few  millions  of  arithmetic  operations, 
we  can  have  them  easily.  One  thing  we  all  ought  to  want  to  do 
with  them  is  to  have  our  workstations  busily  chewing  over  the 
last  set  of  data  we  put  in,  whenever  it  is  not  doing  something 


we  explicitly  asked  for,  to  see  what  it  can  find.  What  sort  of 
program  should  we  write  to  do  this?  What  sort  of  “cognos- 
tics”  should  it  look  at?  How  should  it  schedule  its  attention 
to  different  aspects  (of  a  single  body  of  data)?  What  sorts 
of  reports  should  it  make?  How  should  it  assess  urgency  of 
reporting?  These  are  sample  questions.  Looking  at  them  we 
can  see  the  need  for  combining  both  real  data-analysis  expe¬ 
rience  and  sound  complex-computing-system  experience,  if 
these  questions  are  going  to  be  answered  effectively  and  well. 

As  a  final  example,  think  about  the  facilitation  of  prepar¬ 
ing  and  modifying  what  statisticians  are  happy  to  call  “expert 
systems’-  (Some  computer  scientists  think  our  systems  are 
too  simple  for  such  a  title  ) 

Many  of  us  are  going  to  at  least  need  aid  in  preparing 
and  modifying  such  systems.  We  all  hope  the  tasks  will  be 
simpler,  nor  harder.  Within  what  framework  should  our  sys¬ 
tems  be  built  and  modified  for  this  to  be  so?  We  can  have 
ideas  about  this  today,  but  it  is  almost  certainly  too  soon 
to  freeze  one  or  more  standard  frameworks  -  -  however  it  is 
not  too  soon  to  start  thinking  about  what  will  be  important 
when  we  start  to  freeze  frameworks! 

I  turn  now  to  the  challenge  more  specifically  aimed  at 
statisticians:  The 

“back  side  of  omnicompetence’’ 

-  -  which  some  might  vender,  less  delicately,  as  the 
“back  side  of  incompetence” 

I  have  long  complained  of  the  heresy/fallacy  of  omnicom¬ 
petence,  of  the  claim  that  “just  tell  me  exactly  what  arith¬ 
metic  has  been  done  to  the  data,  then  I  will  know  what  your 
results  mean!"  In  other  words:  “If  we  look  at  a  detailed  de¬ 
scription  of  some  arithmetic  that  can  be  applied  to  data,  it  is 
easy  to  understand  just  what  that  arithmetic  does!"  As  most 
of  you  know:  no  one  of  us,  here  or  elsewhere,  can  correctly 
make  such  a  claim.  To  understand  just  what  a  given  set  of 
arithmetic  does  -  -  even  so  simple  arithmetic  as  taking  the 
arithmetic  mean  -  -  is  not  easy  (in  that  case,  we  were  still 
learning  a  hundred  and  fifty  years  after  fiauss). 


I  have  been  slow  to  realize  that  omnicompetence  has  two 
sides,  the  one  just  mentioned,  and  an  even  more  threatening 
“back  side”  which  can  be  recognized/identified  by  these  two 
statements:  “Don’t  bother  me  with  procedures  that  are  not 
simple  and  transparent!”  (If  they  weren’t,  I  might  not  be 
omnicompetent!).  “1  admit  data  analysis  has  to  be  inductive, 
but  understanding  how  to  select  ways  to  do  it  should  be  -  - 
nay,  must  be  -  -  deductive”!  (Else  1  might  not  be  omnicom¬ 
petent.)  The  back  side  is  the  more  threatening  one,  since 
its  acceptance  would  keep  us  from  polishing  our  methods  to 
make  them  work  better,  whenever  that  polishing  would  make 
either  arithmetic  or  natural  heuristics  more  complicated,  and 
would  confine  us  to  methods  whose  advantages  we  could  de¬ 
duce,  thus  ruling  out,  in  particular,  heuristically-based  sug¬ 
gestions  whose  performance  has  been  validated  by  simulation. 
It  would  mean  tying  our  hands,  keeping  us  far  from  doing  as 
well  as  we  could  -  -  in  the  interest  of  having  a  simple,  rela¬ 
tively  teachable  account  of  how  our  methods  got  that  way. 
How  they  got  that  way,  NOT  how  they  perform! 

The  implicit  assumption  is  that  we  are  trying  to  teach 
how  the  then-current  classic  methods  came  about  and  cap  be 
extended,  rather  than  teaching  how  they  behave  and  when 
to  use  them. 

Fred  Mosteller  and  I  have  worked  a  lot  together.  I  admire 
his  teaching  skills;  so  I  asked  myself  how  one  could  approach 
such  problems  in  a  way  that  might  satisfy  both  of  us.  One 
possibility  seems  to  be  a  two-layer  approach,  combining  a 
very  simple,  easy-to-follow  procedure  for  understanding  with 
a  warning  that  that  procedure  was  usually  not  good  enough 
and  you  ought  to  do  better,  as  by  using  the  following  canned 
program,  which  has  thus-and-so  properties.  When  I  tried  this 
on  Fred,  he  said  that  he  and  some  colleagues  were  already 
doing  something  close  to  this  in  a  non-statistical  field.  So 
maybe  it  is  a  good  idea. 

There  was  time  when  it  was  felt  to  be  the  kiss  of  death 
for  a  new  statistics  book  to  call  it  “a  cookbook"  I  am  saying 
that  this  is  no  longer  reasonable  or  wise.  Who  among  you 
would  like  to  throw  out  all  cookbooks  from  your  household. 
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and  replace  them  by  a  text  entitled  “Principles  of  Culinary 
Chemistry”?  Do  you  think  such  a  change  would  make  your 
meals  taste  better?  It’s  a  pretty  good  idea  to  stick  to  “Fannie 
Farmer”! 

Let  us  also,  for  a  moment,  compare  ourselves  with  an¬ 
alytical  chemists  -  -  a  respectable  profession  of  comfortable 
antiquity.  Analytical  chemists  live  in  a  real  world,  they  take 
the  occurrence  of  “interferences”  seriously.  They  do  not  ex¬ 
pect  to  analyze  for  a  single  particular  substance  in  the  same 
single  way,  when  that  substance  occurs  together  with  very 
different  sets  of  other  substances.  Analyses  in  in  one  "ma¬ 
trix”  -  -  in  urine,  for  example  -  -  need  not  be  like  analyses  in 
another  matrix,  either  in  blood  or  in  distilled  water.  Many 
details  of  the  analytical  procedures  are  important,  for  most 
of  them  there  are  heuristics,  for  some  there  are  none.  The 
test  of  a  good  procedure  is  how  well  it  works,  not  how  well  it 
is  understood.  No  one  believes  that,  even  from  a  highly  de¬ 
tailed  knowledge  of  chemistry,  it  is  today  possible  to  deduce 
the  operations  by  which  we  should  analyze  for  a  particular 
substance. 

If  we  are  to  do  as  well  as  the  analytical  chemists,  we 
will  have  to  disavow  simplicity  and  deducibility  as  absolute 
standards. 

We  can  retain  simplicity  and  deducibility  as  secondary 
goals  -  -  as  ways  of  choosing  among  otherwise  equally  qual¬ 
ified  candidates;  as  reasons  for  slight  modulations  of  very 
good  procedures,  modulations  keeping  nearly  all  -  -  but  per¬ 
haps  not  all  -  -  of  the  highest  performance  we  can  otherwise 
get.  (I  would  always  give  up  1%  of  efficiency  for  simplicity  - 
-  and  0.4%  for  deducibility  -  -  partly  for  the  sake  of  the  pro¬ 
cedure  itself,  but  mainly  because  I  would  hope  to  do  better 
with  analogs  of  the  procedure  in  more  general  or  more  com¬ 
plicated  problems.)  But  we  dare  not  accept  simplicity  and 
deducibility  as  major  goals. 

As  I  look  back  over  more  than  four  decades  of  experience 
with  both  practice  and  theory,  I  have  to  conclude  that  almost 


may  doubt  this,  so  I  shall  talk  briefly  about  four  examples. 

First,  multiple  comparisons.  The  first  serious  proposal 
about  multiple  comparisons  used  the  studentized  range,  for 
interesting  reasons  that  do  not  concern  us  here.  Shortly 
thereafter,  F-based  methods  became  temporarily  popular, 
perhaps  because  they  seemed  simpler,  and  more  closely  re¬ 
lated  to  previous  practice.  It  took  a  long  time  before  it  was 
generally  recognized  that  they  spent  error  rate  where  it  either 
wasn’t  needed  or  couldn’t  be  effective. 

It  took  another  decade  or  so  to  realize  that  there  were  sit¬ 
uations  where  much  more  complicated  multiple  comparisons 
procedures  (like  those  of  Ramsey,  or  even  those  of  Perilz) 
were  needed,  that  there  were  several  kinds  and  slices  of  mul¬ 
tiple  comparisons  problems,  for  each  of  which  different  pro¬ 
cedures,  not  necessarily  simple  one,  were  appropriate  (ep 
Braun  and  Tukey  1983c). 

Spectrum  analysis,  of  the  kind  once  called  modern,  also 
started  for  unexpected  reasons.  The  great  thing  that  made 
its  techniques  fly  was  giving  up  the  simple  picture  of  energy 
at  a  few  isolated  frequencies,  which  was  altogether  too  simple, 
and  had  led  to  procedures  that  were  inadequate  for  -  -  and 
misleading  when  applied  to  -  -  many  real-world  problems. 

Omnicompetence  has  returned  to  spectrum  analysis  re¬ 
cently,  in  the  form  of  overbelief  in  maximum  entropy  esti¬ 
mates,  accompanied  in  some  minds  by  ideas  equivalent  to  “if 
you’ve  calculated  some  moments,  any  fit  you  take  seriously 
must  match  these  moments  exactly!”.  For  those  of  us  who 
have  thrown  off  the  shackles  of  the  arithmetic  mean  -  -  itself 
an  example  of  moment  matching  -  -  this  attitude  hardly  seems 
either  reasonable  or  wise  (There  are  places  where  maximum 
entropy  estimates  will  serve  us  well,  not  for  esoteric  reasons, 
but  because  they  can  be  seen  to  perform  well  there.) 

Next,  let  us  turn  to  robust /resistant  techniques.  Here 
there  was  a  history  of  many  decades  of  simple  ideas:  ideas 
like  rejection  rules  and  trimmed  means.  The  Princeton  Ro- 


every  major  step  with  which  I  have  been  concerned  offers  an 
example  or  two  of  the  back  side  of  omnicompetence  You 


business  Study  (Andrews,  et  al  1977e),  and  its  follow-on 
waves,  looked  at  several  hundred  individual  estimates  and 
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several  thousand  kinds  of  linear  combinations  of  these  esti¬ 
mates.  Out  of  this  non-simplicity  came  the  biweight  (with  a 
tuning  constant  of  about  7).  for  which  there  is  still  no  deduc¬ 
tive  approach,  but  for  which  experimental  sampling  -  -  both 
in  its  “swindled”  form  as  Monte  Carlo  and  in  its  even  more 
sophisticated  form  as  configural  polysampling  -  -  has  shown 
the  high  quality  of  its  behavior 


Herentlv  I  have  seen  a  further  stage  of  the  same  other 
side  A  paper  showing  how  to  fit  a  straight  line  better  -  - 
where  'better"  was  demonstrated  by  experimental  sampling 
-  -  in  the  face  of  very  uncomfortably  non-constant  variance 
revolted  referees  and  editor  because  the  procedure  was  “too 
•implicated"  If  we  are  to  do  well  in  complicated  problems, 
we  are  almost  sure  to  start  with  complicated  solutions,  which 
we  may  later  learn  to  simplify  -  -  but  probably  only  to  a 
limited  extent 

l^-t  me  turn  finally  to  clustering,  where  there  are  reputed 
to  be  review  articles  with  thousands  of  references  -  -  almost 
every  one  starting  with  some  simple  procedure,  possibly  de¬ 
riving  some  conceivably  interesting  possibilities,  and  taking 
no  thought  at  all  for  modification  or  fine  tuning 

The  implied  attitude  is  it  seems  to  me  “we  can  always 
invent  the  wheel  (even  if  ours  is  square,  we  don’t  have  to 
modify  it)"1 

Recently  Katherine  Hanson  and  I  have  been  trying  the 
opposite  approach  starting  with  a  very  simple  test  bed,  and 
a  somewhat  more  plausible  initial  approach,  how  can  we  mod¬ 
ify  and  adjust  and  complicate  the  procedure  until  its  perfor¬ 
mance  at  least  comes  close  to  that  of  the  human  eye. 

Let  me  show  you  a  couple  of  pictures,  pictures  that  are 
real  and  typical  for  a  special  situation,  but  possibly  somewhat 
misleading  because  tuning  for  a  wider  variety  of  situations 
may  force  us  to  degrade  performance  in  this  special  case 

First  a  challenge,  a  picture  of  150  undistinguished  points. 
Some  of  you  may  be  able  to  see  3  concentrations,  some 
may  guess  that  this  is  a  mixture  of  samples,  of  nearly  equal 
size,  from  3  circular  Gaussian  populations,  populations  that 
clearly  interpenetrate  one  another  quite  a  little 


The  challenge  -  -  ISO  undistinguished  points 
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Now  see  what  a  carefully  tuned,  fully  automatic  proce¬ 
dure  -  -  which  knows  it  is  looking  for  3  clusters,  but  makes  no 
explicit  use  of  Gaussianity,  or  of  isotropy,  or  of  equal  sample 
size,  can  do!  Here  we  have  distinguished  the  3  samples  by 


The  response:  segmentation  (artificially  separated  y  *  ) 
based  on  ISO  undistinguished  points  and  "find  3* 
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14  misclassificatione,  only  a  few  more  than  for 
population  based  discriminants  (8) 
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using  different  plotting  characters  •  ■  and  also  separated  the 
three  clusters  from  one  another  -  *  so  we  can  see  both  the 
merlap  of  the  samples  and  the  performance  of  the  technique 

this  last  example  suggests  the  eye-brain  combination 
is  not  simple  If  we  are  to  make  Al  -  -  here  Automatic  I  n- 
sight  -  -  competitive  with  human  processes,  we  cannot  expect 
to  build  effective  Al  on  very  simple  rules  and  then  trust  to 
millions  billions  of  CPI  cycles  to  gel  the  right  answer 

Large  parts  of  the  Al  profession  are  badly  infested  with 
their  version  of  the  same  "back  side"  The  combination  of 
very  simple  elements  arid  very  massive  computer  processing 
is  unlikely  to  provide  either  efficient  or  even  effective  support 
for  human  purposes  We  shall  need  to  choose  subtle  basic 
elements  and  to  comhine  them  creatively  -  -  only  then  can 
we  make  good  use  of  all  those  CPI.'  cycles. 

If  all  my  life  I  and  a  few  others  had  taken  the  “back  side 
of  omnicompetence  seriously,  there  would  have  been  long 
delays  in  appearance  and  use  of  effective  multiple  compar¬ 
isons.  effective  spectrum  analysis,  effective  straight-line  fit¬ 
ting  under  very  difficult  circumstances,  and,  I  hope,  effective 
clustering 


1  urge  each  of  you  to  purge  yourself  of  the  “back  side  of 
omnicompetence”  as  thoroughly  as  you  can! 
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1  Introduction 

Although  the  problem  of  distributing  computa¬ 
tions  over  networks  of  processors  has  received  a 
great  deal  of  theoretical  attention,  only  now,  as 
commercial  systems  have  become  available,  are 
the  practical  limitations  of  parallel  algorithms  be¬ 
coming  apparent.  In  particular,  communication 
costs  can  render  an  otherwise  attractive  algorithm 
unsatisfactory.  It  is  therefore  necessary  to  have 
analyses  of  parallel  algorithms  that  indicate  un¬ 
der  what  conditions  they  will  perform  effectively. 

The  purpose  of  this  note  is  to  describe  and 
analyze  a  parallel  algorithm  for  computing  the 
QR  factorization  of  an  n  x  p  matrix  X  (for  ap¬ 
plications  of  this  factorization  see  [1,  Ch.9j).  The 
algorithm  is  designed  to  run  on  a  ring  of  r  pro¬ 
cessors  that  communicate  by  message  passing.  In 
the  next  section,  we  will  describe  the  sequential 
algorithm  and  its  numerical  properties.  The  par¬ 
allel  version  of  the  algorithm  will  be  described  in 
§3,  and  its  analysis  given  in  §4.  The  analysis  sug¬ 
gests  some  modifications  of  the  algorithm,  which 
are  sketched  in  §5. 

2  The  Sequential  Algorithm 

Let  X  be  an  n  x  p  matrix  with  n  >  p.  Then  X 
can  be  factored  in  the  form 

X  =  QR,  (2.1) 

where 

_ QrQ  =  I  (2.2) 

'Department  of  Computer  Science  and  Institute  for  Phys¬ 
ical  Science  and  Technology.  This  work  was  supported 
in  part  by  the  Air  Force  Office  of  Sponsored  Research 
under  Grant  AFOSRr82-0078. 


and  R  is  upper  triangular.  From  (2.1)  and  (2.2) 
it  follows  that 

A  =  XTX  =  RrR-,  (2.3) 

that  is,  R  is  the  Cholesky  factor  of  the  cross- 
product  matrix  A.  This  suggests  the  following 
three  step  algorithm  for  computing  Q  and  R. 

1.  Compute  A  from  (2.3). 

2.  Compute  the  Cholesky  factor  R  of  A. 

3.  Partitioning  X  and  Q  by  rows  in  the  form 


solve  the  systems 

qjR  =  xJ  (»  =  l,2,...,n).  (2.4) 

for  the  rows  of  Q. 

From  a  computational  point  of  view,  this  al¬ 
gorithm  is  quite  satisfactory.  The  formation  of 
A,  or  rather  its  upper  half,  requires  about  np2/ 2 
floating-point  additions  and  multiplications;  the 
formation  of  R  requires  ps/6;  and  the  solution  of 
the  system  (2.4)  requires  np2/ 2.  Thus  for  large  n, 
the  entire  algorithm  requires  about  np2  floating¬ 
point  additions  and  multiplications. 

From  a  numerical  point  of  view,  the  algorithm 
is  less  satisfactory.  On  the  positive  side,  because 
the  ft’s  are  generated  as  solutions  of  (2.4),  the 
method  has  a  backward  rounding-error  analysis. 
Specifically,  if  the  computations  are  performed  in 
t-digit  decimal  arithmetic,  then  there  is  a  matrix 
E  of  order  10-<||X||  such  that 


Figure  1:  A  Ring  of  Four  Processors 


Thus  little  information  about  X  is  lost  in  the  pas¬ 
sage  to  Q  and  R. 

However,  the  columns  of  Q  may  fail  to  be  or¬ 
thonormal;  that  is,  (2.2)  may  fail  to  hold — even 
approximately.  If  the  columns  of  X  are  scaled  so 
that  they  have  norm  one,  then  this  phenomenon 
occurs  precisely  when  R~l  is  large;  for  in  this  case 
the  system  (2.4)  will  be  ill  conditioned,  and  Q  will 
be  inaccurately  determined.  This  means  that  we 
can  at  least  recognize  the  problem  when  it  occurs 
by  applying  a  condition  estimator  to  R  [1,  Ch.lj. 

Moreover,  the  problem  admits  a  fix.  For  if  we 
apply  the  algorithm  again  to  Q,  we  will  obtain  a 
matrices  P  and  5,  with  S  upper  triangular  such 
that 

PS  =  Q  +  F  (2.6) 


If  (2.6)  and  (2.5)  are  combined,  the  result  is 


PSR  =  QR+  FR  =  X  +  E+  FR. 


Thus  the  factorization  P(SR)  also  reproduces  X. 
Usually  P  will  have  columns  orthogonal  to  work¬ 
ing  accuracy.  If  not,  the  reorthogonalization  can 
be  repeated. 


3  Parallel  Implementation 


The  algorithm  sketched  above  has  a  natural  im¬ 
plementation  on  a  ring  consisting  of  r  processors. 
Such  a  ring  is  shown  for  r  =  4  in  Figure  1.  Here 
we  shall  suppose  that  communication  is  clockwise, 
as  indicated  by  the  arrows. 


A.  =  XjXi 
A  =  0 

lor  to  r  loop 
A  =  A  +  Ai 
send  (A,) 
receive  (A,) 
end  loop 
compute  R 
solve  Q(R  =  Xi 


Figure  2:  Program  for  Processor  t 


The  idea  of  the  implementation  is  simple.  The 
matrix  is  partitioned  in  the  form 


X  = 


(  X r\ 

X2 


(3.1) 


where  each  block  has  roughly 


n 

m  =  — 

r 


rows,  and  block  X,  is  assigned  to  processor  i. 
Each  processor  initially  computes 


Ai  =  XjXi. 


The  Ai  are  then  circulated  around  the  ring.  As 
they  pass,  the  processors  add  them  together,  so 
that  at  the  end  each  processor  has  a  copy  of  the 
cross  product  matrix  A  =  £”=1  A,.  It  is  then  a 
simple  matter  for  each  processor  to  compute  R 
and  solve  the  systems  (2.4)  to  form  Qi  = 
where  Qi  is  the  tth  block  of  Q  in  a  partitioning 
conformal  with  (3.1). 

Figure  2  contains  a  program  implementing  this 
algorithm.  In  it  the  array  A«  is  used  both  to  hold 
the  initial  A,-  computed  by  the  processor  and  to 
hold  the  A,  ’s  from  other  processors  as  they  circu¬ 
late  around  the  ring.  The  function  send  sends  a 
block  of  data  (in  this  case  whatever  is  currently  in 
Ai)  to  the  next  processor.  The  function  receive 
informs  the  system  where  to  place  incoming  data. 

An  attractive  feature  of  this  algorithm  is  that 
each  processor  ends  up  with  a  copy  of  R,  from 
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which  it  can  check  whether  a  reorthogonalization 
step  is  necessary.  Since  each  processor  will  reach 
the  same  conclusion,  the  reorthogonalization  can 
start  forthwith  without  any  initial  communication 
among  the  processors. 

4  Analysis  of  the  Algorithm 

We  now  turn  to  the  analysis  of  the  parallel  algo¬ 
rithm.  The  time  required  to  complete  the  algo¬ 
rithm  may  be  divided  into  two  parts:  the  time 
devoted  to  computing  and  the  time  devoted  to 
communication.  Let  us  look  at  the  computing 
time  first. 

We  shall  assume  that  it  requires  time  a  to 
perform  an  addition  and  a  multiplication  in  the 
program  of  Figure  2.  As  is  customary  in  this  kind 
of  analysis,  a  includes  all  the  indexing  and  loop¬ 
ing  overhead  in  the  algorithm,  so  that  it  will  be 
considerable  greater  than  the  time  for  the  bare 
arithmetic  operations. 

The  costs  of  the  various  computations  are  sum¬ 
marized  below. 


XjXi 

Summing 

R 

Qi 


bmp2  a 
%mp*a 


Since  Xj Ay  is  symmetric,  we  need  only  compute 
half  of  it,  which  accounts  for  the  factor  of  1/2  in 
the  first  item.  Summing  these  items  and  making 
the  substitution  m  =  n/r,  we  obtain  the  total 
computing  time 

Ta=  T^r  +  hrp2  +  +  imp2  a.  (4.1) 

For  the  communication  time,  we  shall  assume 
that  the  send-receive  sequence  in  the  program  re¬ 
quires  a  fixed  setup  time  a  which  is  independent 
of  the  length  of  the  message.  Thereafter,  data  is 
transmitted  at  a  rate  of  r_1  items  per  unit  time. 
Since  only  half  of  the  Ay’s  are  being  transmitted, 
the  total  communication  time  will  be 

Te  =  ra  +  i rp2r .  (4.2) 

Looking  at  (4.1)  and  (4.2),  we  see  that  their 
contributions  can  be  divided  into  two  parts.  The 


first  is  the  term  np2a/rt  which  decreases  with 
the  number  of  processors.  Since  np2ct  is  the  time 
taken  by  the  sequential  algorithm,  we  could  hardly 
expect  greater  speedup. 

The  second  part  consists  of  the  other  terms, 
which  are  linear  in  r  and  must  ultimately  domi¬ 
nate  the  first  part.  Although  we  may  initially  see 
a  decrease  in  time  as  we  add  processors,  we  must 
ultimately  come  to  a  point  where  adding  proces¬ 
sors  actually  increases  the  total  time.  This  point 
can  be  computed  by  adding  (4.1)  and  (4.2),  differ¬ 
entiating,  setting  the  results  to  zero,  and  solving 
for  r.  The  result  is 


1  +  r+g 


where  we  have  introduced  the  relative  parameters 

_  a 
a  —  — 
a 


_  a 

T  ~ 


There  are  several  things  to  observe  about  this 
expression.  The  number  y/2n  is  an  upper  bound 
on  the  number  of  processors  we  can  profitably  use. 
It  comes  from  the  fact  that  we  must  sum  the  Ay's 
on  each  processor,  and  has  nothing  to  do  with 
communication  time.  However,  poor  communica¬ 
tion  can  certainly  make  things  worse.  For  if  the 
time  take  to  transmit  numbers  is  greater  than  the 
time  to  perform  arithmetic  operations  a  will  be 
greater  than  one,  and  will  dominate  in  (4.3).  If  p 
is  not  too  small,  the  relative  setup  time  a  will  play 
a  smaller  part;  however,  on  some  existing  comput¬ 
ers  with  a  large  startup  time  even  this  term  can 
dominate. 

5  Revising  the  Algorithm 

The  importance  of  an  analysis  like  the  one  in  the 
last  section  is  that  it  can  suggest  how  to  modify  an 
algorithm  to  make  it  run  better  and  it  can  indicate 
features  that  it  is  desirable  to  have  in  a  parallel 
computer.  The  algorithm  for  the  QR  factorization 
is  a  case  in  point.  The  dominant  terms  come  from 
passing  the  Ay  around  the  ring  and  from  summing 


where  beta  is  the  time  to  transmit  a  single  number 
in  brigade  mode.  Although  we  have  not  gotten 
rid  of  the  dependence  on  r  we  have  reduced  its 
influence  by  a  factor  of  p2. 

We  have  been  deliberately  vague  about  the  de¬ 
tails  of  this  algorithm  in  the  hope  that  the  reader 
will  undertake  the  rewarding  task  of  fleshing  it 
out.  The  program  will  be  considerably  more  com¬ 
plicated,  as  will  be  the  analysis.  However,  at  the 
end  you  will  have  formulas  by  which  you  can  com¬ 
pare  the  two  algorithms  for  yourself. 
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Figure  3:  Pairwise  Summation 

them  on  the  processors.  Let  us  see  what  we  can 
do  about  these  two  roadblocks,  beginning  with 
the  latter. 

Although  we  cannot  calculate  A  entirely  in 
parallel,  we  can  reduce  the  amount  of  computa¬ 
tion  by  a  pairwise  summation  algorithm.  Specif¬ 
ically,  the  odd  numbered  processors  send  their 
A,’s  to  the  next  even  numbered  processors  which 
add  them  to  their  own  A,-’s.  Then  the  even  num¬ 
bered  processors  pair  off  to  do  another  pairwise 
summation — and  so  on  until  there  is  only  on  pro¬ 
cessor  left,  which  will  of  course  contain  A.  An 
example  for  r  =  7  is  given  in  Figure  3.  From 
this  it  is  seen  that  pairwise  summation  reduces 
the  time  from  r  to  approximately  log2  r. 

However,  we  still  have  a  communications  prob¬ 
lem;  for  at  the  last  step  we  must  pass  a  block  of  p2 
numbers  a  distance  of  about  m/2  processors.  We 
can  do  nothing  about  this  unless  we  are  willing  to 
assume  something  further  about  the  how  the  pro¬ 
cessors  communicate.  One  possibility  is  that  the 
processors  can  link  up  and  pass  numbers  bucket 
brigade  fashion  from  the  source  processor  to  the 
destination  processor.  If  the  number  of  processors 
in  the  brigade  is  d,  then  the  operation  will  take 
time 

(d  +  p2-!)/?,  (5.1) 
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MATRIX  COMPUTATIONS,  SIGNAL  PROCESSING  AND  SYSTOLIC  ARRAYS 


Franklin  T.  Luk,  Cornell  University 


Abstract 


Parallel  matrix  computing  has  become  an  essen¬ 
tial  part  of  real-time  signal  processing.  Systolic 
arrays  and  associated  algorithms  for  computing  the 
symmetric  eigenvalue  decomposition,  the  singular 
value  decomposition  and  the  generalized  singular 
value  decomposition  are  described  in  detail. 


Numerical  linear  algebra  is  an  important  tool 
for  modem  signal  processing  practitioners,  who  must 
solve  systems  of  linear  equations,  compute  eigen¬ 
values,  eigenvectors,  singular  values  and  singular 
vectorsC  cf.  Bromley  and  Speiser1  ).  The  necessity 
that  these  matrix  operations  be  completed  in  real 
time,  together  with  the  availability  of  VLSI/VHS1C 
technology,  has  led  to  the  development  of  special 
purpose  multiprocessor  systolic  arrays.  In  this  paper 
we  discuss  systolic  arrays  and  their  associated  paral¬ 
lel  algorithms  for  computing  the  symmetric  eigen¬ 
value  decomposition,  the  singular  value  decomposi¬ 
tion  and  the  generalized  singular  value  decomposi¬ 
tion. 

A  given  symmetric  nXn  matrix  A  can  be 
diagonalized  via  a  similarity  transformation: 

A=VEVr,  (1) 

where  the  matrix  V  is  n  Xn  orthogonal  and  E  is 
nXn  diagonal.  The  eigenvalue  decomposition 
(EVD)  is  extensible  to  the  diagonalization  of  an 
mXn  (  m  )  matrix  A.  Two  different  transfor¬ 
mations  are  required  to  compute  the  singular  value 
decomposition  (SVD): 

A=UZVr,  (2) 

where  the  matrices  U  (mxm)  and  V  (nxrt)  are 
orthogonal,  and  the  matrix  E  (m  Xn  )  is  nonnegative 
diagonal.  For  applications  and  computations  of  these 
decompositions  see  Golub  and  Van  Loan2  and 
Dongarra  et  al.3.  The  SVD  can  be  extended  to  a 
simultaneous  diagonalization  of  two  real  matrices  A 
(mXn)  and  B  (pXn)  by  two  orthogonal  matrices 
U  (mxm)  and  V  (pxp)  and  a  nonsingular  matrix 
X  (nXn) 

UtAX  =Da  3  diagla,, I  (3) 


VtBX  =Db  m  diagl/3, , 


The  factorization  (3-4)  is  called  a  generalized  singu¬ 
lar  value  decomposition  (GSVD).  It  is  useful  for 
solving  various  constrained  and  generalized  least 
squares  problems,  e.g., 


subject  to 


I  Ax  —61,  =  min 


I  Bx  —  d  I,  < « 


Many  systolic  arrays  have  been  proposed  for 
these  matrix  decompositions.  Brent  and  Luk4 
presented  a  systolic  array  for  computing  the  sym¬ 
metric  eigenvalue  decomposition.  SVD  arrays  are 
given  in  Brent  and  Luk.4,  Brent,  Luk  and  Van 
Loan5,  Finn,  Luk  and  Pottle6,  Heller  and  Ipsen7, 
Luk8  and  Schimmel  and  Luk9,  and  GSVD  arrays  in 
Brent,  Luk  and  Van  Loan10  and  Luk11. 

The  most  effective  parallel  eigenvalue  and 
singular  value  algorithms  for  full,  dense  matrices 
are  of  the  Jacobi  type.  Jacobi  techniques  are  easily 
implementable  on  mesh-connected  processors. 
Indeed,  they  were  used  for  finding  eigenvalues  and 
singular  values  on  the  ILLIAC  IV,  the  first  parallel 
computer  (  see  Luk12  and  Sameh13  ).  To  compute  an 
nXn  SVD,  a  parallel  Jacobi  scheme5,8  requires  n2 
processors  and  0(nS)  time,  where  S  denotes  the 
number  of  sweeps  for  convergence.  The  parameter 
S  is  a  slowly  growing  function  of  n  and  is  conjec¬ 
tured  to  equal  O  (log  n  )  4.  In  comparison,  the  LIN- 
PACK3  SVD  procedure  requires  time  O  (n  3).  Unfor¬ 
tunately,  Jacobi-SVD  algorithms  are  applicable  only 
to  square  matrices.  For  a  rectangular  matrix  A ,  an 
obvious  strategy  is  to  first  compute  its  QR  decompo¬ 
sition  (QRD)  of  A  : 

A  =Q  o  ,  (5) 

where  the  matrix  Q  is  mXm  orthogonal  and  R 
nXn  upper  triangular,  and  to  then  apply  the  SVD 
procedure  to  the  square  matrix  R.  This  approach  is 
particularly  suitable  for  the  case  where  m  »n  (  cf. 
Chan14  ).  However,  we  need  to  handle  the  interfac¬ 
ing  of  different  arrays.  To  alleviate  the  problem, 
Luk8  suggested  one  “triangular*  processor  array  for 
computing  both  the  QRD  and  the  SVD.  Subse¬ 
quently,  a  new  GSVD  algorithm  implementable  on 
the  same  array  was  proposed  by  Luk11. 


The  basic  tool  in  a  Jacobi  method  is  the  2x2 
plane  rotation 


/rot-  0080  8100 
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as  the  basic  problem  concerns  the  diagonalization  of 
a  2x  2  matrix  by  the  rotation: 

_  p  a  d  i  0 

seer  ”/<«)-  o  J  .  (7) 


Suppose  q  ^0  (  else  choose  either  0  =  0  or  0  =ir/2  ). 
It  is  well  known  that  t  =  tan0  satisfies  the  qua¬ 
dratic  equation: 

r 2  +  2pr— 1  =  0 ,  (8) 

where 

p  =  =cot20  .  (9) 

2 q 

The  two  solutions  to  (8)  are 

t  _  sign(p) 

ipi+V  i+p2 

cos0  =  — — L=  ,  (10) 

Vl+r* 
sin0  =  t  cos0 

and 

t  =  —  sign(p)  [  Ipl+V \+p2  ], 

cos@  =  1  ,  (11) 

V  1+r2 
sin0  =  t  cos 0  . 

The  angle  0  associated  with  (10)  is  the  smaller  of 
the  two  possibilities;  it  satisfies  0^101  <ir/4, 
whereas  the  one  associated  with  (11)  satisfies 
ir/4  ^  10  I  <t r/2.  We  refer  to  a  rotation  through 
the  smaller  angle  as  an  “inner  rotation”  and  one 
through  the  larger  angle  as  an  “outer  rotation”  (  cf. 
Stewart13  ).  The  “inner  rotation”  is  chosen  in  Brent 
et  al.4’3  and  the  “outer  rotation”  in  Luk8.  If  the 
given  matrix  is  diagonal  (  q  =  0  )  then  an  “inner 
rotation”  means  0=0  and  an  “outer  rotation” 
implies  0  =  ir/2.  In  the  former  case  the  matrix 
stays  unchanged,  whereas  in  the  latter  case  the 
eigenvalues  are  interchanged: 


0  -1 

P  0 

0  1 

r  0 

1  0 

0  r 

-1  0 

0  p 

By  solving  an  appropriate  sequence  of  2x2 
EVD  problems,  we  compute  an  EVD  of  a  general 
nXn  matrix  A  The  Jacobi  transformation  is 

Tl}  :  A  *-JtTj  A  Jtj  ,  (12) 

where  Jt)  is  a  rotation  in  the  (i,y)  plane  chosen  to 
annihilate  the  (i,y)  and  (y , i )  elements  oj  A  The 
transformation  Tij  will  produce  a  matrix  A  satisfy¬ 
ing 

off{A)  =  off{A)  -  2a*  ,  (13) 

i.e,  the  matrix  A  is  more  “diagonal”  than  A  The 
value  of  (t  ,y )  is  determined  according  to  some  ord¬ 
ering,  to  be  determined  such  that  all  the  off- 
diagonal  elements  will  be  annihilated  once  in  any 
group  of  n(n— 1)/2  rotations  (called  a  “sweep”  ). 
A  well  known  example  is  the  cyclic-by-rows  order¬ 
ing,  illustrated  here  in  the  n  =4  case: 

(i,y )  =  (1,2),(1,3)XM)X2J)X2^)X3,4). 


Jacobi  methods  lend  themselves  to  parallel  com¬ 
putations.  Brent  and  Luk4  developed  a  square  pro¬ 
cessor  array  and  a  “parallel”  ordering  that  allows 
1  nJ2 )  simultaneous  rotations.  Their  new  ordering  is 
amply  illustrated  by  the  n  =8  case: 

(t.y)  -  (1,2)  ,  (3,4)  ,  (5,6) ,  (7,8)  , 

(1.4) ,  (2,6),  (3,8),  (5,7), 

(1.6) ,  (4,8),  (2,7),  (3,5), 

(1,8)  ,  (6,7)  ,  (4,5)  ,  (2,3)  , 

(1.7)  ,  (8,5) ,  (6,3)  ,  (4,2)  , 

(1.5) , (7,3),  (8, 2), (6,4), 

(1,3),  (5,2),  (7, 4),  (8, 6). 

Rotation  pairs  associated  with  each  “row”  of  the 
above  ordering  can  be  calculated  concurrently.  We 
present  a  parallel  Jacobi  algorithm  for  A 

Algorithm  EVD. 
do  until  convergence 

for  each  (i  ,y  )  according  to  the  “parallel”  order¬ 
ing 

A  *~Jij  A  Jtj  .  □ 

By  convergence  we  mean  that  the  parameter  off(A  ) 
has  fallen  below  some  pre-selected  tolerance.  How¬ 
ever,  it  is  difficult  to  monitor  offtA  )  in  the  settings 
of  parallel  computations.  Since  convergence  is  fast 
(  ultimately  quadratic  )  it  is  a  usual  practice  to  stop 
iterations  after  a  sufficiently  large  number  (  say 
ten  )  of  sweeps.  Details  on  the  processor  array  are 
given  in  Brent  and  Luk4.  Important  points  worth 
emphasizing  are  that  only  nearest  neighbor  connec¬ 
tions  are  required,  that  broadcasting  can  be  avoided 
through  a  staggering  of  computations,  and  that  one 
sweep  of  the  algorithm  is  implementable  in  time 
0(n). 

Numerical  experiments  were  performed  on  a 
VAX-11/780  at  Cornell  University.  Double  floating 
data  types  were  used:  each  number  is  binary  nor¬ 
malized,  with  an  8-bit  signed  exponent  and  a  57-bit 
signed  fraction  whose  most  significant  bit  is  not 
represented.  The  accuracy  is  thus  approximately  17 
decimal  digits.  The  results  are  presented  in  Table  1. 
We  started  with  random  nXn  matrices  whose  ele¬ 
ments  came  from  a  uniform  distribution  in  the 
interval  (—1,1);  we  stopped  when  the  parameter 
off{A )  had  been  reduced  to  10-  12  times  its  original 
value  The  rate  of  convergence  was  quadratic, 
confirming  theoretical  predictions,  and  only  eight  or 
fewer  sweeps  were  required  for  n  ^  200.  Empiri¬ 
cally  we  find  that  S  =0(logn),  and  there  are 
theoretical  reasons  for  believing  this,  although  it  has 
not  been  proved  rigorously.  In  practice  S  can  be 
regarded  as  a  constant  (  say  10  )  for  all  realistic 
values  of  n  (  say  n  ^  1000  ). 


Table  1.  Average  Number  of  Sweeps 
Required  by  Algorithm  EVD 


n 

trials 

#  sweeps 

4 

5000 

2.64 

6 

5000 

3.37 

8 

2000 

3.79 

10 

2000 

4.09 

20 

1000 

4.94 

30 

1000 

5.41 

40 

1000 

5.74 

50 

1000 

5.99 

100 

500 

6.78 

One  may  expect  that  software  (or  hardware) 
for  the  symmetric  eigenvalue  problem  can  be  used 
to  solve  the  SVD  problem.  For  example,  we  may 
compute  an  eigenvalue  decomposition  of  the  matrix 


However,  Brent  et  aL5  gave  detailed  explanations 
why  we  should  not  approach  the  SVD  problem  as  a 
symmetric  eigenvalue  problem. 


Singular  Value  Decomposition 

The  basic  problem  here  concerns  the  diagonali- 
zation  of  a  2x2  matrix  by  the  two  rotations  7(0) 
and  K  (  <py. 


K(4>Y 


w  * 

y  z 


7(0)  = 


d,  o 

0  dr 


(14) 


A  two-stage  procedure  is  adopted.  First,  find  a  rota¬ 
tion  S  (tfj)  to  symmetrize  the  matrix  : 


W  X 

p  q 

y  * 

q  r 

(15) 


If  x  =y  we  choose  t/»=0,  otherwise  we  compute 

_ w  +z  _  .  , 

p  =  — — _  =  cotip  , 


x-y 

sim/i  =  y(P) 

v  1+p2 

cosxp  =  p  sim/» . 


(16) 


Second,  diagonalize  the  resulting  symmetric  matrix: 


J(0Y 


7(0)  = 


d  -  0 


0  d, 


P  q\ 
q  r\ 

Finally,  K  (d>)  is  given  by 

K(<t>Y  =  7(0  F S(\pY  , 


(17) 

(18) 


ie,  </>  =  0  +  .  Again,  by  solving  an  appropriate 

sequence  of  2x2  SVD  problems,  we  compute  an 
SVD  of  a  general  n  X  n  matrix  A  The  Jacobi 
transformation  is 


TtJ  :  A  *~JtTj  A  Kl}  ,  (19) 

where  Ji)  and  Ktj  are  rotations  in  the  (£  ,j  )  plane 
chosen  to  annihilate  the  (£,/)  and  (/,£)  elements  qf 
A  The  transformation  Ttj  will  produce  a  matrix  A 
satisfying 

offiA  )  =  offiA  )  -  at)  -  a $  ,  (20) 

Le,  the  matrix  A  is  more  “diagonal"  than  A 

Luk8  proposed  a  triangular  processor  array  that 
directly  computes  an  SVD  of  a  rectangular  matrix. 
The  associated  SVD  algorithm  has  two  stages.  First,  a 
QR  decomposition  is  computed  of  A  as  it  is  fed  into 
the  array.  Second,  a  Jacobi-SVD  algorithm  is  applied 
to  the  resultant  triangular  matrix.  The  pivot  block 
is  restricted  to  contiguous  diagonal  elements,  so  as  to 
preserve  the  triangular  structure  of  the  matrix.  This 
so  called  “odd-even"  ordering  is  well  illustrated  by 
the  n  =  8  case: 

(i,j)  =  (1,2),  (3,4),  (5,6),  (7,8), 

1,  (2,3),  (4,5),  (6,7),  8. 


“Outer  rotations”  are  required  to  ensure  that  all 
off-diagonal  elements  will  be  annihilated.  Details 
on  the  array  are  presented  in  Luk8.  Again  the 
important  points  concern  the  nearest  neighbor  con¬ 
nections,  the  avoidance  of  broadcast,  and  the  com¬ 
pletion  of  a  sweep  in  0(n)  time.  We  present  here 
the  associated  SVD  algorithm  for  an  nXn  upper 
triangular  matrix  A : 

Algorithm  SVD. 


do  until  convergence 
begin 

{  “outer  rotations”  are  required  } 

for  £  =  1,  3,  •  •  •  ( £  odd  ),  2,  4,  •  •  •  ( £  even  )  do 
A  A  Kt 

end.  □ 


Lu+ 1  > 


Simulation  experiments  similar  to  the  ones  described 
in  the  previous  section  were  performed  at  Cornell 
The  only  difference  was  that  the  initial  matrices 
were  upper  triangular.  The  results  are  presented  in 
Table  2,  where  we  observe  that  S  =  0(logn  ). 

Table  2.  Average  Number  of  Sweeps 
Required  by  Algorithm  SVD 


n 

trials 

#  sweeps 

4 

1000 

2.97 

6 

1000 

3.76 

8 

1000 

4.21 

10 

1000 

4.55 

20 

100 

5.54 

30 

100 

6.09 

40 

100 

6.40 

50 

100 

6.72 

100 

10 

7.56 

I 


Generalized  Singular  Value  Decomposition 

We  now  present  a  parallel  GSVD  algorithm. 
Only  the  simple  case  where  A  and  B  are  both 
square  (nXn)  and  B  is  nonsingular  is  considered. 
The  first  direct  GSVD  procedure  was  given  by 
Paige16.  It  implicitly  applies  a  Jacobi-SVD  algo¬ 
rithm  to  the  matrix  C=AB~1,  and  is  numerically 
appealing  in  that  only  orthogonal  transformations 
are  applied  to  A  and  B  and  that  the  matrices  B  ~ 1 
and  C  are  never  explicitly  formed.  We  assume  that 
both  matrices  A  and  B  are  upper  triangular  (  do 
two  preparatory  QR  decompositions  if  necessary  ). 
Orthogonal  transformations  U ,  V  and  Q  are  to  be 
determined  so  that  the  two  resulting  matrices 
U  7  AQ  and  V  7  BQ  have  parallel  rows,  Le, 

UtAQ=D-VtBQ,  (21) 

where  D  is  some  diagonal  matrix.  Defining  the  non¬ 
singular  matrix  X  =B~lV ,  we  get  the  desired 
GSVD: 

V7  BX  =  /, 

U7  AX  =UtAQQt  X 

=  DVT  BQQ7  B~lV 
-D. 

On  the  other  hand,  note  that 

Ut(AB-')V  =D.  (22) 

So  the  transformations  U  and  V  can  be  obtained 
via  an  SVD  procedure  applied  to  G  The  gist  of 
Paige’s  method  lies  in  its  implicit  computation  of  an 
SVD  of  AB~1  without  explicitly  forming  the 
matrices  S_l  and  C 

Luk11  modified  Paige’s  algorithm  for  parallel 
computations  by  adopting  the  “odd-even”  ordering. 
A  big  advantage  is  that  the  upper  triangular  struc¬ 
tures  of  both  A  and  B  can  be  preserved.  Now,  if 
both  A  and  B  are  upper  triangular,  then  so  are  the 
matrices  B  _1  and  G  As  such,  the  two  satisfy  these 
special  relations : 

(23) 


triangularize  both  matrices  (  cf.  Paige  16  ). 

How  do  these  transformations  affect  the  two 
nXn  upper  triangular  matrices  A  and  B?  We 
have 

A  *~U7ii+l  A  Qij+i,  (27) 

B  *~^i, i  +  i  B  Qm+i  > 

where  V(|t+1  and  Qij(+1  denote  appropriate 

nXn  rotations  in  the  (i,t+l)-plane.  Note  that 
both  matrices  U^+iA  and  i  B  have  only  one 
non-zero  subdiagonal  element  each,  in  the  (i+l.i)- 
position.  These  two  extraneous  elements  are  annihi¬ 
lated  by  the  same  rotation  Qi>(+1 ,  that  restores  both 
A  and  B  to  triangular  forms.  Here  is  a  parallel 
GSVD  algorithm1 1  for  upper  triangular  A  and  B: 

Algorithm  GSVD. 
do  until  convergence 

for  t  =  1,  3,  •  •  •  (t  odd),  2, 4,  •  •  •  (t  even)  do 
begin 

{  Ut*+  i  and  Vu+  ,  are  “outer  rotations”  } 
determine  t/u+1  and  to 
annihilate  cfi/+  j  and  cJ+  1(J  ; 

A  *~U7j  +  xA\  B 

find  QM+  j  to  zero  out  ai+u  and  £>(+  u  ; 

A  *~AQlf  +  x\  B  *~BQtf+  j 
end.  □ 

By  convergence  we  mean  that  the  rows  of  A  and  B 
become  parallel  according  to  some  predetermined 
measure.  Algorithm  GSVD  is  easily  implemen table 
on  the  triangular  QRD-SVD  array  of  Luk®.  We 
compute  initial  QR  decompositions  of  both  A  and  B 
as  they  are  fed  into  the  array.  The  SVD  of  C1**1 
and  the  triangularization  of  both  A‘** 1  and  B‘**1 
are  performed  in  parallel  on  the  processor  array  in  a 
straightforward  manner®.  The  significant  fact  is 
that  a  GSVD  can  be  computed  in  time  O  (nS  ). 

Addenda 


C1**1  =  Al**l(B~ly**1 , 

where  M1**1  denotes  the  2x2  submatrix  of  M 
formed  by  intersecting  its  t  th  and  (i  +l)st  rows  and 
columns.  We  have  thus  proved 

(24) 

the  key  condition  for  an  implicit  application  of 
Algorithm  SVD  to  the  upper  triangular  matrix  G 
We  find  rotations  Y  and  Z  for  a  2x  2  SVD: 

Y TC1** 


lZ  =S, 

where  S  is  diagonal.  Then 

YtA1'"  =  S-ZtB,*+1: 


(25) 


(26) 

Le,  the  two  rows  of  Y7  A1**1  and  ZT Bl**x  are 
paralleL  We  can  thus  find  one  rotation  W  to 


Real  time  signal  processing  is  an  exciting,  new 
research  area.  We  have  described  two  different  pro¬ 
cessor  arrays  for  finding  eigenvalues  and  singular 
values.  There  are  plenty  of  open  problems  that 
await  satisfactory  solution.  Two  important  examples 
are  data  partitioning  (  cf.  Brent  et  aL5  and 
Schreiber17  )  and  fault  tolerance  (  cf.  Huang  and 
Abraham1®,  Jou  and  Abraham19,  Luk20  and  Luk 
and  Park21  ). 
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PARALLEL  ARCHITECTURE 

A  TUTORIAL  FOR  STATISTICIANS 
William  F.  Eddy,  Carnegie -Mellon  University 


1.  INTRODUCTION  AND  SUMMARY 

For  a  number  of  years  computer  science  research 
has  studied  the  notions  of  parallel  processing  from 
both  a  hardware  and  a  software  point  of  view. 
Much  of  this  research  has  had  little  practical  impact 
up  to  now.  However,  this  situation  is  rapidly 
changing  and  the  next  few  years  will  make  systems 
which  incorporate  the  notions  of  parallel  processing 
much  more  widely  available  to  statisticians.  The 
Department  of  Statistics  at  Carnegie-Mellon 
University  has  experience  using  vector  processors, 
has  an  attached  processor  on  one  of  the  department 
VAXes,  and  has  experience  using  a  network  of 
processors  in  a  data-flow  system.  This  experience 
has  led  us  to  believe  that  the  time  is  ripe  for 
statisticians  to  begin  a  major  use  of  parallel 
computation. 

The  purpose  of  this  article  is  to  provide  a  brief 
introduction  to  the  various  notions  of  parallelism 
and  their  realization  in  various  hardware 
architectures.  Roughly  speaking  the  article  is 
divided  into  two  broad  parts:  the  first  is  an 
introduction  to  the  terminology  and  programming 
notions  needed  in  concurrent  programming;  the 
second  is  a  short  review  of  the  particular  classes  of 
parallel  architecture  that  we  have  found  useful. 

The  sections  on  concurrent  programming  will 
introduce  the  two  key  notions  that  are  necessary  for 
an  understanding  of  the  parallel  execution  of 
programs: 

•  Interprocess  Communication 

•  Process  Synchronization. 

The  sections  on  hardware  architectures  will  discuss 
rhe  three  broad  classes  of  machines  which  we  nave 
used  and  which  we  believe  can  be  generally  use*m 
mr  statisticians 

•  Vector  Processors 

•  Attached  Processors 

•  Networks  of  Processors. 

In  the  references  a  number  of  papers  are  listed 
that  are  not  explicitly  referred  to  in  the  text,  we 
have  found  these  papers  very  helpful  in  the 
organization  of  our  thinking  about  parallel 
computation. 


2  SPECIFICATION  OF  PARALLELISM 

In  order  to  think  clearly  about  the  specification 
of  parallelism  it  is  necessary  to  remember  the 
fundamental  concepts  of  sequential  programming. 
The  single  most  important  concept,  for  our  purposes 
here,  is  the  notion  of  a  sequential  process.  A 
sequential  process  is  the  actual  execution  of  a 
sequential  program,  a  sequential  program  specifies 
the  sequential  execution  of  a  list  of  statements. 

A  concurrent  program  specifies  the  execution  of 
more  than  one  sequential  program  that  can  be 
executed  as  parallel  or  concurrent  processes.  These 


sequential  processes,  executing  in  parallel,  do  not 
necessarily  have  to  reside  in  separate  memories  nor 
do  they  have  to  be  executed  on  distinct  processors. 
Thus,  parallel  processes  can  be  implemented  on  real 
machines  in  one  of  three  ways. 

•  The  processes  can  be  multi  programmed , 
so  that  they  share  the  memory  of  a 
single  processor.  This  is.  of  course, 
precisely  what  a  time-sharing  system 
does. 

•  The  processes  can  share  a  single  memory 
but  be  run  on  separate  processors.  This 
is  usually  referred  to  as  multiprocessing 
and  obviously  requires  specially-designed 
hardware. 

•  The  processes  can  be  distributed  to  a 
number  of  individual  processors,  each 
having  its  own  memory,  connected  by  a 
communications  network. 

As  mentioned  above  the  two  basic  problems  that 
concurrent  programs  or  rather  the  programmers  who 
create  the  programs'  face  are  process 
synchronization  and  interprocess  communication. 
Execution  of  a  concurrent  program  can  be 
rppresen’ed  by  an  acvc  c  directed  graph  vy'ierf  each 
node  represents  a  process  and  each  onecten  a>c 
indicates  that  the  process  at  its  end  nodi  canno’ 
execute  until  the  process  at  its  source  node  na$ 
completed.  this  giaph  determines  the  nr"te.-; 
Synchronization  and  s  called  a  proces •  hove  q-apr: 
Communication  between  concurrent  processes  can  be 
represented  by  an  acyclic  directed  graph  where  each 
node  represents  a  read  or  write  of  data  by  a 
process  and  a  directed  arc  indicates  the  transfer  of 
data  from  the  process  at  its  source  node  to  the 
process  at  its  end  node;  this  graph  determines  the 
interprocess  communication  and  is  called  a  data  flow 
graph.  The  graph  is  made  acyclic  by  making 
multiple  copies  of  processes  which  send  and  receive 
data  more  than  once. 


2  1  Coroutines 

One  of  the  earliest  notions  for  the  specification 
of  concurrent  processes  was  the  idea  of  coroutines. 
Broadly  speaking  a  coroutine  can  be  thought  of  as  a 
process  implemented  as  a  subroutine;  the  distinction 
is  that  subroutines  are  initiated  by  a  call  and 
terminated  by  a  return  in  a  strictly  hierarchical  way. 
Coroutines  are  initiated  and  terminated  by  a  resume 
statement  in  a  non-hierarchical  fashion;  except,  of 
course,  when  they  are  initiated  and  terminated  in  a 
hierarchical  fashion  (so  that  they  are  actually 
subroutines!.  The  crucial  points  to  notice  are  that: 

•  The  resume  command  serves  to 
synchronize  the  processes  in  the  separate 
coroutines  so  that  all  the  processes 
implemented  as  coroutines  can  execute 
on  a  single  processor  with  no  loss  in 
execution  time  (except  for  overhead'. 


•  Only  one  coroutine  can  be  executing  at  a 
time  so  that  the  switching  between 
processes  that  can  occur  in  a 
muftiprogramming  environment  is 
completely  determined  by  the 
programmer. 

The  only  process  flow  graphs  for  concurrent 
programs  that  can  be  implemented  by  coroutines  are 
linear. 


2  2.  Fork  and  Join 

Another  early  notion  for  the  specification  of 
concurrent  processes  was  the  fork  statement.  The 
siatment  specifies  that  the  invoked  routine  should 
execute  in  parallel  with  the  invoking  routine.  The 
invoking  routine  can  execute  a  /oin  statement  to 
force  synchronization  with  the  completion  of  the 
invoked  routine  The  fork-join  mechanism  is  a  very 
powerful  tool  for  implementing  parallel  processing 
ana  'S  a  featuie  of  the  Um>  operating  svjtem  its 
only  major  flaw  is  that  definition  of  a  process  is 
directly  connected  to  its  synchronization.  An 
arbitrary  process  flow  graph  can  be  implemented 
using  fork  and  join. 


2.3.  Cobegin  and  Coend 

The  statements  cobegin  and  coend  delineate 
blocks  of  statements  that  can  be  executed 
concurrently.  The  essential  notion  is  that  all  blocks 
within  the  scope  of  the  cobegin-coend  begin  at  the 
same  time  and  the  cobegin-coend  terminates  only 
when  all  blocks  within  its  scope  have  terminated. 
One  distinct  advantage  of  the  cobegin-coend 
structure  is  that  there  is  only  one  path  in  and  one 
path  out  of  the  construct;  another  advantage  is  the 
explicit  specification  of  which  processes  are  being 
executed  concurrently.  All  process  flow  graphs  that 
are  series-parallel  can  be  implemented  by  cobegin- 
coend;  however,  the  implementation  of  arbitrary 
graphs  by  cobegin-coend  requires  the  introduction  of 
extra  null  processes. 


3.  SYNCHRONIZATION 

While  there  is  only  one  fundamental  reason  for 
being  concerned  with  the  synchronization  of 
concurrent  processes  (to  control  the 
cooperation/interference  of  one  process  with 
another),  there  are  two  distinct  types  of 
synchronization  which  can  be  useful.  The  first  of 
these  types  occurs  when  the  values  of  some  shared 
variables  are  not  "correct;"  that  is.  one  process 
should  not  use  the  variables  until  some  other 
process  has  changed  their  value  during  its  (the 
second  process's)  execution.  The  second  of  these 
types  occurs  when  some  subset  of  statements  in  a 
process  must  be  treated  as  an  indivisible  operation. 
The  dividing  line  between  these  two  types  of 
synchronization  is  not  always  sharp,  as  can  be  seen 
by  considering  the  following  simple  example. 
Suppose  a  variable  X  has  the  value  1.  Consider  a 
concurrent  program  which  consists  of  the  following 
two  assignment  statements  to  be  executeu  in 
parallel 

2  •  x  x 

3  *  X  „  x 


The  two  statements  ..an  be  executed  m  either  oiaer. 
The  end  result  of  this  concurrent  program  will  be 
correct  provided  that  one  of  the  statements  does 
not  use  the  shared  variable  X  until  thp  other 
statement  has  set  its  value,  equivalently  the  result 
will  be  correct  provided  each  statement  is  treated 
as  an  indivisible  unit.  These  two  types  of 
synchronization  can  be  implemented  in  several 
distinct  ways  which  are  discussed  in  the  following 
sections. 


3.1.  Shared  Variables 

Perhaps  the  simplest  way  to  synchronize  two 
processes  is  to  have  one  of  the  processes  set  a 
variable  when  a  particular  condition  is  satisfied  and 
to  have  the  other  process  test  the  variable  until  its 
value  indicates  that  the  condition  is  satisfied.  One 
obvious  drawback  to  this  technique  is  that  the 
process  which  is  waiting  will  be  "spinning  its 
wheels"  testing  the  shared  variable;  this  is  an 
obvious  waste  of  CPU  time.  A  less  obvious 
drawback  is  that  the  programmer  is  compelled  to 
understand  the  synchronization  that  is  necessary  and 
to  explicitly  program  it.  The  most  serious  drawback 
is  the  possibility  of  a  deadlock,  a  deadlock  occurs 
when  two  (or  more)  processes  are  waiting  for 
events  which  can  never  occur.  As  a  simple  example 
consider  the  following  concurrent  program. 

These  programs  implement  two  concurrent 
processes  as  the  subroutines  named  ONE  and  TWO. 
The  synchronization  is  implemented  through  the  two 
shared  variables  named  START  1  and  START2;  the 
variable  STARTi  has  the  value  TRUE  when  process 
numbered  i  is  in  the  part  of  its  program  which 
directly  influences  the  process  numbered  3-i.  This 
part  is  generally  referred  to  as  the  critical  section. 
The  subroutines  WORKl  and  WORK2  which  are  not 
given  should  contain  the  actual  critical  sections  of 
the  two  processes.  The  important  point  to  notice  is 
that  both  processes  could  enter  the  loops  at  about 
the  same  time  with  the  result  that  neither  one  of 
them  could  continue  doing  anything  productive. 

A  fairly  simple  and  straightforward  modification 
of  Program  3-1  alleviates  most  of  the  possiblmes 
for  a  deadlock  and  is  given  in  Program  3-2  below. 
If  the  two  processes  executing  subroutines  ONE  and 
TWO  run  at  exactly  the  same  speed  and  start  at  the 
same  time  then  this  program  will  never  actually 
execute  the  subroutines  WORKl  and  WORK2 

Peterson  (1981)  introduced  a  third  shared  variable 
into  this  protocol;  the  third  variable  guarantees  that 
both  processes  will  eventually  get  to  execute  tnr 
critical  sections  and  is  implemented  in  Progiam  3  ? 
below 

3  2.  Semaphores 

A  better  and  more  detailed  notion  for 
implementing  the  mutual  exclusion  above  using 
shared  variables  is  to  use  a  semaphore.  * 
semaphore  is  a  non-negative  integer  there  are  w< 
operations  defined  on  a  semaphore  * 

•  Signalls)  which  executes  the  assignment  s 

♦  1  ->  s  and 

•  Waitlsl  which  delays  execution  until  s  is 

positive  and  then  sets  s  -  1  s. 

Using  a  semaphore,  the  concurrent  program  with 
critical  sections  given  in  the  previous  section  can  be 
rewritten  as  in  Program  3-4  below. 


Program  3-1:  Mutual  Exclusion  Protocol 
with  Possible  Deadlock 


block  data 

common  startl,start2 

logical  startl,start2 

data  start  1/ .false./, start2/ .false . / 

end 

subroutine  one 
common  startl,start2 
logical  startl , start2 

c  work  that  does  not  need  to  be  synchronized 
c  with  the  other  process  can  be  done  here 
startl=.true. 
do  while  (startl) 

if  .not.  start2  then 
call  workl 
startl= .false . 

endif 

enddo 

c  work  that  does  not  need  to  be  synchronized 
c  with  the  other  process  can  be  done  here 
end 

subroutine  two 
common  startl ,start2 
logical  startl , start2 

c  work  tnat  does  not  need  to  be  synchronized 

c  with  the  other  process  can  be  done  here 

start2= .true, 
do  while  ( . start2) 

if  .not.  startl  then 
call  work2 
start2=. false. 

endif 

enddo 

c  work  that  does  not  need  to  be  synchronized 

c  with  the  other  process  can  be  done  here 

end 

Program  3-2:  Modified  Mutual  Exclusion  Protocol 
block  data 

common  startl , start2 

logical  startl,start2 

data  startl/ . false . / , start2/ .false./ 

end 

subroutine  one 
logical  startl, start2 
common  startl,start2 
do  while  ( .true. ) 

c  work  that  does  not  need  to  be  synchronized 

c  with  the  other  process  can  be  done  here 

startl=.true. 

if  .not.  start2  then 
call  workl 

enuif 

startl=. false. 

c  work  that  does  not  need  to  be  synchronized 

c  with  the  other  process  can  be  done  here 

enddo 
end 


subroutine  two 
common  startl , start2 
logical  startl , start2 
do  while  (.true.) 

c  work  that  does  not  need  to  be  synchronized 

c  with  the  other  process  can  be  done  here 

start2=.true. 

if  .not.  startl  then 
call  work2 

endif 

start2=. false. 

c  work  that  does  not  need  to  be  synchronized 

c  with  the  other  process  can  be  done  here 

enddo 
end 

Program  3-3:  Peterson's  Mutual  Exclusion  Protocol 
block  data 

common  startl,start2, process 
logical  startl , start2 
integer  process 

data  startl/. false. /, start2/. false./ 

data  process/1/ 

end 

subroutine  one 

common  startl, start2, process 
logical  startl , start2 
integer  process 

c  work  that  does  not  need  to  be  synchronized 

c  with  the  other  process  can  be  done  here 

startl=.true. 
process=2 

do  while  (start2. and. process. eq. 2) 

enddo 

call  workl 

startl=. false. 

c  work  that  does  not  need  to  be  synchronized 

c  with  the  other  process  can  be  done  here 

enddo 
end 

subroutine  two 

common  startl, start2, process 
logical  startl ,start2 
integer  process 

c  work  that  does  not  need  to  be  synchronized 

c  with  the  other  process  can  be  done  here 

start2=.true. 
process=l 

do  while  (startl .and. process. eq.l) 

enddo 

call  work2 

start2=. false. 

c  work  that  does  not  need  to  be  synchronized 

c  with  the  other  process  can  be  done  here 

enddo 
end 


! 


Program  3-4:  Mutual  Exclusion  with  Semaphores 


bloc*  data 
common  s 
integer  s 
data  s/1/ 
end 

subroutine  one 
common  s 
integer  s 
do  while  ( -true. ) 

c  work  that  does  not  need  to  be  synchronized 
c  with  the  other  process  can  be  done  here 
call  wait(s) 
call  workl 
call  signal (s) 

c  work  that  does  not  need  to  be  synchronized 

c  with  the  other  process  can  be  done  here 

enddo 
end 

subroutine  two 
common  s 
integer  s 
do  while  ( .true. ) 

c  work  that  does  not  need  to  be  synchronized 

c  with  the  other  process  can  be  done  here 

call  wait(s) 
call  work2 
call  signal (s) 

c  work  that  does  not  need  to  be  synchronized 

c  with  the  other  process  can  be  done  here 

enddo 
end 

subroutine  wait(s) 
integer  s 

do  while  (s.le.  0) 
enddo 

5  -  S  ~  i 
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3.3.  Data-flow  Synchronization 

A  data-flow  algorithm,  in  the  sense  of  Treleaven, 
Brownbndge,  and  Hopkins  (1982),  is  a  collection  of 
statements  together  with  a  directed  graph  which 
represents  the  flow  of  data  among  the  statements. 
We  find  the  notion  of  a  data-flow  algorithm,  as  a 
means  for  implementing  a  concurrent  process, 
compelling,  for  a  variety  of  reasons.  First,  it  is 
possible  to  implement  a  data-flow  system  on  any 
network  of  processors  which  supports  some  simple 
communication  primitives;  nearly  every  computer 
installation  has  a  network  of  processors  so  this 
approach  to  parallel  processing  has  very  broad 
applicability.  Special  hardware  is  required  only  for 
those  tasks  which  have  a  communicatiomcomputation 
ratio  which  cannot  be  supported  by  standard 


hardware.  Second,  programming  a  data-flow  system 
is  simple;  it  is  not  necessary  to  have  any  special- 
purpose  languages  nor  any  special  understanding  in 
order  to  program  such  a  system.  A  description  of 
a  primitive  data-flow  system  developed  for  a 
network  of  VAXes  is  given  in  Eddy  and  Schervish 
(1986).  Third,  a  programmer  need  not  worry  about 
the  complex  synchronization  issues  that  are  usually 
attendant  to  parallel  programs;  this  separates  the 
task  of  scheduling  computations  from  the  task  of 
programming  them.  The  critical  notion  in  any  data¬ 
flow  system  is  the  granularity  of  the  problems  that 
are  appropriate  for  it.  that  is.  the  size  of  the 
computational  tasks  which  will  be  treated  as 
indivisible  units.  For  example,  O'Leary  and  Stewart 
(1985)  discuss  a  special-purpose  data-flow  system 
known  as  the  2MOB.  One  interesting  and  unusual 
feature  of  data-flow  systems  is  that  the  processor 
speeds  need  not  be  identical,  and  they  can  in  fact 
be  stochastic.  Additionally,  the  communications 
network  used  for  interprocessor  communication  can 
have  a  bandwidth  and  latency  which  are  stochastic 
also. 

The  essential  feature  of  any  problem  that  can  be 
run  efficiently  on  a  data-flow  system  is  that  the 
interprocess  communication  cost  must  be  a  small 
fraction  of  the  process  computation  cost.  We 
believe  that  any  problem  which  is  amenable  to 
parallel  computation  can  be  implemented  in  an 
efficient  manner  on  a  data-flow  system;  the  only 
dilemma  is  what  are  the  correct  size  grains 
(indivisible  computational  tasks)  for  the  particular 
application. 

One  interesting  research  problem  is  the  use  of  a 
data-flow  system  for  the  numerical  computation  of 
high-dimensional  integrals.  Briefly,  the  problem  is 
to  decide  how  to  decompose  the  integral  across  the 
processors,  for  example; 

1.  Should  iterated  integrals  be  performed 

recursively  with  different  levels  of  the 
recursion  calculated  on  different 

processors? 

2.  Is  it  better  to  use  Gaussian  quadrature  or 
adaptive  Newton-Cotes  techniques  on  a 
data-flow  system? 

3.  How  do  Monte  Carlo  techniques  compare 
with  iterated  Gaussian  techniques  or  with 
composite  Gaussian  techniques? 

A  second  interesting  problem  is  the  optimization 
of  high-dimensional  functions  on  a  data-flow 
system.  Some  work  on  this  problem  is  reported  by 
Schnabel  (1986).  Again,  briefly,  the  basic  question 
is  how  to  decompose  the  computation,  for  example: 

1.  In  a  global  optimization  problem,  how 
often,  relative  to  function  evaluations, 
should  the  separate  processors  exchange 
information’ 

2.  Should  the  separate  processors  only 
evaluate  the  function  or  should  the 
decision  making  be  decentralized  also? 

3.  Should  the  computation  of  function 
values  be  localized  so  that  individual 
processors  "learn"  about  the  local 
behavior  of  the  objective  function’ 
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Another  interesting  class  of  research  problems  is 
the  stochastic  modeling  of  the  data-flow  process 
a  data-flow  system  can  be  thought  of  as  a  multi¬ 
processor  with  stochastic  service  times.  A  typical 
problem  might  be  to  minimize  the  makespan  (the 
expected  value  of  the  completion  time'  if 
preemptive  scheduling  is  not  allowed  and  there  is  a 
cost  associated  with  assigning  tasks  to  processors 
Related  problems  have  been  attacked  by  Bruno. 
Downey,  and  Fredenckson  U981>  and  otners 

4  MESSAGE  PASSING 


4.1.  Communication  Channels 

A  communications  channel,  in  an  abstract  sense, 
is  the  specification  of  a  source  and  destination  for 
messages.  Such  channels  are  usually  implemented 
in  by  either  direct  naming  where  the  process  names 
serve  as  source  and  destination  or  by  mailboxes 
where  the  mailbox  name  is  global  and  can  be  used 
by  several  source  lor  destination!  processes 
simultaneously.  The  use  of  communication  channels 
affects  the  interaction  of  concurrent  processes  in 
another  way.  The  processes  may  perform  their 
communication  so  that  the  output  of  each  process 
becomes  the  input  of  another  process;  such 
communication  is  usually  called  a  pipeline. 
Pipelines  are  a  feature  of  the  Unix  operating  system. 

One  common  alternative  mode  of  process 
interaction  is  a  client/server  model.  A  client 
process  requests  some  particular  service  from  a 
server  process;  upon  completion  the  server  process 
send  a  completion  message  to  the  client  process. 
The  client/server  communications  relationship  is 
usually  implemented  with  a  mailbox.  A  client 
process  can  send  a  message  to  any  possible  server 
process  and  a  server  process  can  receive  a  message 
from  any  possible  client  process. 


42.  Synchronous/ Asynchronous  Communication 

A  critical  feature  of  the  interprocess 

communication  is  whether  the  communications 
protocol  is  synchronized  or  not.  An  additional 
complication  is  that  the  synchrony  can  be  different 
at  each  end  of  the  communications  channel.  In 
particular,  each  time  an  actual  read  or  write  is 
issued  by  a  process  it  can,  in  principle,  be  executed 
with  an  implied  "wait  until  completion"  or  not. 
Consider  as  an  example  a  dedicated  server  process. 
Presumably,  the  server  process  begins  by  issuing  a 
read  with  a  wait  until  completion  since  the  process 
does  nothing  until  it  receives  a  request  from  a 
client.  The  client  process  on  the  other  hand 
presumably  issues  the  request  for  service  as  a  write 
with  no  wait  if  the  service  is  not  time  critical  or  as 
a  write  with  a  wait  if  the  client  is  unable  to 
continue  until  the  service  is  completed.  When  the 
server  process  has  finished  its  work  it  ssues  a 
write  (back  to  the  client!  without  a  wan  anti  then 
reissues  the  read  with  a  wait.  The  client  process 
issues  a  read  from  the  server  process  with  o' 
without  a  wait  depending  on  the  e«ac  nature  of  thr 
service  provided.  The  pom:  ome  is  mat  wheme' 
not  the  communication  is  synchronous  or  not 
depends  on  the  application. 

In  the  case  of  asynchronous  writes,  one  (often 
unanticipated)  problem  is  that  a  large  number  of 
writes  may  be  issued  without  a  corresponding 
number  of  reads.  As  a  consequence  the  receiving 
process  must  have  available  an  essentially  unlimited 


amount  of  buffer  space  to  store  the  messages  until 
the  receiving  process  is  ready  to  read  them.  A  sort 
of  intermediate  protocol  between  totally 
synchronous  and  totally  asynchronous  communication 
is  to  have  some  fixed  limited  amount  of  buffer 
space;  this  allows  the  sending  process  to  operate 
asynchronously  unless  it  gets  too  far  ahead  of  the 
receiving  process.  In  this  case  the  communications 
protocol  should  not  let  the  asynchronous  write  be 
executed  because  of  a  lack  of  buffer  space. 


5  SEQUENTIAL  PROCESSORS 

The  essential  features  of  the  "von  Neumann" 
model  of  a  stored  program  digital  computer  are 

1.  The  data  is  represented  by  digits  in  some 
number  system. 

2.  The  program  and  data  both  reside  in  the 
same  memory. 

3.  Instructions  are  executed  one  after 

another. 

In  the  last  three  decades  the  speed  of  serial 
computers  has  improved  enormously.  While  a  large 
part  of  this  improvement  has  stemmed  from 
improved  technology  at  least  some  of  this 
improvement  has  come  from  the  partial  introduction 
of  parallel  processing.  In  particular; 

1.  Separate  processors  now  handle  the  input 
and  output;  this  was  previously  a  major 
burden  on  the  CPU  because  of  the 
relative  slowness  of  external  I/O  devices. 

2.  Execution  of  instructions  is  somewhat 
overlapped  both  because  of  the  existence 
of  special  processors  for  certain 
instructions  ano  also  because  the 
individual  instructions  are  decomposed 
into  (and  processed  in)  separate  parts 
This  decomposition  is  referred  to  as 
pipelining  and  is  discussed  in  more 
detail  m  Section  6  below. 

3.  Memory  design  now  allows  essentially 
simultaneous  access  to  consecutive 
storage  elements  without  interference 
(interleaving). 

Flynn  (1966)  introduced  a  nomenclature  for 
models  of  computation  which  has  become,  although 
imprecise,  the  standard.  His  scheme  classifies 
machines  on  the  basis  of  two  attributes: 

1.  whether  the  machine  can  process  more 
than  one  instruction  simultaneously; 

2.  whether  the  machine  can  process  more 
than  one  data  item  simultaneously. 

The  resulting  four  types  of  machines  are; 

1.  SISD  -  single  instruction,  single  data; 

2.  MISD  -  multiple  instruction,  single  data; 

3.  MIMD  -  multiple  instruction,  multiple 
data. 
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4.  SIMD  -  single  instruction,  multiple  data. 

It  will  be  helpful  in  the  discussion  that  follows 
to  describe  various  computer  models  by  analogy  to 
a  fast  food  restaurant;  the  customers  (or. 
equivalently,  their  orders)  represent  instructions,  the 
items  ordered  represent  the  data  being  processed, 
the  servers  represent  processors  and  the  Assistant 
Manager,  if  present,  represents  the  control  unit. 

The  von  Neumann  model  describes  the  SISD 
machines.  In  the  fast  food  analogy  there  is  a 
single  queue  of  customers  and  a  single  server  under 
direction  of  the  Assistant  Manager;  each  customer's 
order  is  filled  before  the  next  customer's  order  is 
taken. 

It  is  generally  agreed  that  MISD  machines  do  not 
exist  although  some  authors  put  pipeline  machines  in 
this  category. 

The  MIMD  machines  are  generally  very 
specialized,  even  unique.  Two  early  examples  are 
the  C.mmp  and  Cm*  machines  built  at  Carnegie- 
Mellon  University.  Cm*  consisted  of  50  processors, 
each  with  its  own  memory,  in  five  clusters  of  ten 
processors  each.  For  example,  memory  references 
could  refer  to  memory  attached  to  the  same 

processor,  memory  attached  to  a  different  processor 
m  the  same  cluster,  or  memory  attached  to  a 
different  processor  in  a  different  cluster.  The  times 
required  for  access  to  data  items  in  these  three 

case  are  in  the  ratio  of  1:3:9.  Obviously  then, 

efficient  use  of  Cm*  requires  programs  to  have  a 
certain  "locality"  in  addition  to  their  parallel  nature. 

An  early  commercial  MIMD  machine  was  the 
Denelcor  heterogeneous  flement  Processor  (HEP). 
The  HEP  could  have  up  to  16  processors  attached  to 
its  memory  bus;  unfortunately  only  a  very  small 

number  of  these  machines  were  built.  See  Kowalik 
(1985)  for  considerably  more  detail.  The  fast  food 
analogy  for  an  MIMD  machine  involves  a  chain  of 
fast  food  restaurants.  If  one  restaurant  runs  out  of. 
for  example,  strawberry  milkshakes,  it  may  obtain 
more  from  another  restaurant  in  the  chain  or  it  may 
send  customers  to  another  restaurant  in  the  chain. 

A  different  example  of  MIMD  architecture  which 
will  have  growing  importance  is  discussed  in 
Section  8  below. 

Turning  now  to  the  most  important  of  Flynn's 
categories,  there  are  several  major  types  of  SIMD 
machines: 

1.  array  processors 

2.  associative  processors 

3.  data  flow  processors 

4.  pipeline  processors 


varying  hardware  architecture  such  as  the  Cray  IS. 
the  CDC  Cyber  205,  and  the  two  Japanese  machines, 
the  Fujitsu  VP-200  and  the  Hitachi  S810.  These 
machines  achieve  their  phenomenal  speeds  through  a 
number  of  unusual  architectural  features  together 
with  the  technology  used  to  implement  the  system 
The  single  most  important  feature  is  that  the  bas'C 
machine  cycle  time  is  on  the  order  of  1C 
nanoseconds;  this  is  roughlv  1000  times  faster  than 
a  typical  personal  computer 

The  design  of  these  machines  is  optimized  for 
the  processing  of  arrays  because  most  large-scale 
scientific  calculations  are  based  on  linear  algebraic 
operations.  The  critical  architectural  features  include 
pipelining  of  instructions,  parallel  functional  units 
and  the  use  of  vector  operations.  A  functional  unit 
is  a  specialized  part  of  the  arithmetic/logical  unit  of 
the  CPU  which  implements  some  specific  portion  of 
the  instruction  sr.t  and  operates  totally 
independently  of  the  other  units.  Although  a 
functional  unit  may  require  more  than  one  clock 
period  to  complete  its  calculation,  new  pairs  of 
operands  may  enter  each  unit  during  each  clock 
period.  This  is  because  data  is  moved  into  a  new 
set  of  registers  (within  the  unit)  at  the  end  of  each 
clock  period.  This  is  the  notion  of  pipelining. 

The  use  of  vector  operations  is  implemented  by 
way  of  a  set  of  special  functional  units  and  a  set 
of  special  instructions  which  are  executed  by  those 
units. 

7.  ATTACHED  PROCESSORS 

An  intermediate  approach  to  gaining  high 
performance,  between  brute-force  processor  speed¬ 
up  and  large  numbers  of  processors,  is  to  use 
special-purpose  hardware  which  does  not  have  the 
full  capabilities  of  a  general  purpose  computer. 
Generally,  these  special-purpose  processors  are 
attached  to  some  general-purpose  processor.  The 
attached  processor  appears  to  be  some  sort  of  I/O 
device  to  the  general  machine;  the  attached 
processor  performs  its  I/O  through  the  standard 
facilities  of  the  general  machine. 

There  are  a  surprising  variety  of  attached 
processors  with  quite  diverse  characteristics.  A  few 
that  should  be  of  particular  interest  to  statisticians 
are  the  Star  Technologies  ST-100,  the  FPS  164  and 
264,  the  CSPI  Mini-Map  211,  the  Mercury  Zip,  the 
Analogic  AP400  and  AP500  and  the  Skye  Warrior. 
These  machines  .ange  in  puce  from  about  $5000  to 
$500,000;  their  speeds  ana  capabilities  cover  an 
equally  broad  range.  Generally  speaking  attached 
processors  are  array  processois  in  the  sense  that 
their  architecture  is  optimized  for  array  operations 
they  achieve  speeds  approaching  that  o'  the  vector 
processors  described  in  Section  6  above  at  a 
fraction  of  the  cost. 


5.  systolic  processors 

In  the  next  sections  some  of  these  will  be 
described  briefly. 


6.  VECTOR  PROCESSORS 

Vector  processors  are  having  a  major  impact  on 
the  world  of  scientific  computing  because  of  their 
raw  speed  and  their  ability  to  perform  as  general- 
purpose  machines.  This  means  that  larger  problems 
can  be  solved  more  rapidly  and  users  do  not  have 
to  learn  new  concepts  to  use  the  systems.  There 
are  a  number  of  such  machines  all  having  slightly 


8.  NETWORKS  OF  PROCESSORS 

From  our  point  of  view,  networks  of  similar 
processors  provide  the  most  exciting  prospect  for 
parallel  computation.  The  reason  is  simply  that  we 
believe  that  the  data-flow  approach  described  in 
Section  3.3  above  is  the  easiest  method  of 
concurrent  programming  currently  available  and  is 
simultaneously  the  only  existing  method  which  will 
scale  up  to  a  very  large  numbers  of  processors.  By 
the  words  "scale  up"  here,  we  mean  simply  that 
whatever  works  now  for  four  or  forty  processors 
will  ultimately  work  for  400  or  40,000  processors. 
The  crucial  detail  in  any  system  based  on  a  network 


of  processors  is  what  sort  of  hardware 
intercommunications  network  exists  among  the 
processors. 

Obviously,  the  most  desirable  situation  occurs 
when  every  one  of  the  processors  is  able  to 
communicate  directly  with  every  other  processor.  In 
this  case,  the  interprocessor  communications 
network  is  a  complete  graph.  Obviously,  and 
unfortunately,  such  a  network  will  not  scale  up; 
simple  geometrical  considerations  show  that  the 
number  of  hardware  channels  on  a  single  processor 
cannot  increase  without  bound  as  the  number  of 
processors  does  increase.  So  a  system  with  n 
processors  and  n  hardware  channels  per  processor 
does  not  scale  up. 

The  entire  game  then  is  to  invent  interprocessor 
communication  graphs  which  have  short  paths  (where 
each  channel  counts  as  length  1)  between  any  two 
processors  and  a  number  of  communications 
channels  per  processor  which  grows  very  slowly  (if 
at  all)  as  a  function  of  the  number  of  processors  in 
the  network.  Currently  the  most  popular 

interconnection  scheme  is  based  on  an  n 
dimensional  hypercube,  letting  the  corners  be 
processors  and  the  edges  be  communication 
channels.  If  the  hypercube  has  n=2k  corners  then 
there  are  k-log^n  edges  per  corner  (channels  per 
processor).  At  this  time  there  are  at  least  three 
commercial  hypercube  systems:  the  Intel  IPSC  which 
can  have  up  to  64  nodes  (i.e..  a  26  hypercube),  the 
Hypernet  System  14  which  can  have  up  to  256 
nodes  (i.e..  a  2®  hypercube),  and  the  NCube  Ten 
which  can  have  up  to  1024  nodes  (i.e..  a  2 10 
hypercube). 

The  hypercube  interconnection  scheme  does  not 
scale  up  to  very  large  numbers  of  machines  and  a 
variety  of  other  schemes  are  likely  to  appear  in  the 
market  place  in  the  next  few  years. 
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DISCRETE-FINITE  INFERENCE  ON  A  NETWORK  OF  VAXES 

William  F.  Eddy  and  Mark  J.  Schervish,  Carnegie-  Mel  Ion  University 


SUMMARY 

Because  all  data  are  discrete  and  because  all 
digital  computation  is  discrete,  we  investigate  the 
consequences  of  abandoning  the  implied 
approximations  involved  in  the  use  of  continuous- 
parameter  models  for  continuous  random  variables. 
In  particular,  we  study  models  for  discrete  random 
variables  specified  by  probabilities  which  can  only 
assume  a  finite  number  of  values.  If  we  allow  all 
possible  models,  the  amount  of  calculation  required 
is  really  formidable;  for  one  very  simple  example 
we  estimate  that  one  million  years  of  cpu  time  are 
required  to  determine  the  predictive  distribution  for 
a  single  future  observation.  Consequently,  effort  is 
needed  both  to  reduce  the  amount  of  calculation 
required  and  to  speed  up  the  calculation  that  must 
be  done.  Of  particular  interest  in  this  regard  is  the 
use  of  multiple  microcomputers  as  parallel 
processors.  This  breakthrough  has  two  important 
advantages.  Firstly,  it  dramatically  reduces  the 
"wall  clock  time"  required  to  perform  the  discrete- 
finite  calculations.  Secondly,  it  provides  a 
numerically  more  stable  algorithm  for  computing  the 
results. 


1.  INTRODUCTION 

We  choose  to  investigate  models  for  discrete 
data  which  assume  only  finitely  many  values  for 
several  reasons.  First,  observed  data  are  discrete; 
only  finitely  many  different  values  are  possible  in 
any  particular  situation.  Second,  all  computations 
are  routinely  performed  in  a  digital  computer;  this 
restricts  the  possible  values  at  any  stage  of 
calculation  to  be  a  discrete  finite  set. 

Third,  we  believe  that  the  use  of  such  models 
forces  researchers  to  focus  their  attention  on  the 
important  issues  in  statistical  modelling.  We  agree 
with  Geisser  (1971,  1980)  that  the  primary  purpose 
of  probabilistic  inference  is  predictive  in  nature. 
That  is,  one  models  data  statistically  because  one  is 
interested  in  making  predictions  about  other,  as  yet 
unseen,  data.  The  reason  that  discrete-finite  models 
force  the  focus  onto  predictive  inference  is  that  the 
models  are  simply  vectors  of  non-negative  numbers 
which  add  up  to  1.  We  deliberately  avoid 
expressing  the  models  m  terms  of  parameters  which 
might  be  mistaken  for  quantities  of  interest. 

Fourth,  the  availability  of  substantial  amounts  of 
computer  time  diminshes  one  of  the  major 
drawbacks  to  discrete-finite  models.  Calculation  of 
any  function  of  the  predictive  distribution  from  a 
discrete-finite  model  involves  a  sum  over  the 
various  possible  model  vectors.  Typically,  the 
number  of  possible  models  is  very  large;  in  a  fairly 
trivial  example  described  in  Section  3  below,  the 
number  of  model  vectors  is  6  x  10 16.  We  have 
developed  some  computational  algorithms  which, 
combined  with  very  high  speed  computation,  will 
make  some  previously  infeasible  discrete-finite 
models  computationally  tractable.  In  particular,  we 
have  made  use  of  a  local  area  network  of  various 
models  of  DEC  VAXes  in  order  to  perform  most  of 
the  computations  in  parallel,  thereby  reducing 
elapsed  time.  Such  parallel  computation  also 


facilitates  partitioning  of  the  numerical  results  into 
sums  which  can  be  accumulated  with  greater 
numerical  accuracy. 


2.  DISCRETE-FINITE  MODELS 

Consider  inference  about  a  single  observable  X, 
which  can  assume  one  of  d  possible  values 
xf,  ....  xd-  These  values  can  be  numerical  or 
nominal,  vector  or  scalar;  we  will  refer  to  the  set 
of  possible  values  as  the  observation  space.  The 
distribution  of  X  consists  of  a  vector  p  =  (p  ,„.,p  )T, 
where 

p  =  Pr  (X  =  x  ).  (1) 

1  J 

and 

pT  1  =  1  (2) 

Since  all  calculations  we  perform  are  discrete  and 
finite,  we  assume  each  p  is  constrained  to  equal 
one  of  say  m  possible  Values,  v^  ...,  v  .  For 
simplicity  here,  we  will  only  consider  tfie  case 
where  the  {v  )  lie  on  a  grid  that  is  equally  spaced 
in  [0,  1];  thaf  is,  we  suppose  that 

v^  =  (k  -  1)/(m  -  1). 

One  need  only  consider  that  subset  of  the  collection 
of  ma  "possible"  vectors  p  which  satisfy  Equation  2. 
When  p  is  specified,  all  inference  about  X  can  be 
based  on  Equation  1.  We  will  refer  to  p  as  the 
model  vector  and  to  the  set  of  all  possible  p 
vectors  as  the  model  space. 

Next,  assume  that  one  is  interested  in  making 
inference  about  a  subset  of  some  sequence 
X^  X  ,  ...  of  observables  which  are  exchangeable,  in 
the  sense  that  their  labels  provide  no  information 
about  their  joint  distribution.  We  assume  that  each 

X  must  equal  one  of  the  values  x, .  x^.  A 

theorem  of  de  Finetti  (1937)  (also  see  Hewitt  and 
Savage,  1955)  shows  that  conditional  on  some 
vector  p  satisfying  Equation  2,  the  X  are 
independent  with  distribution  given  by  Equation  1. 
Once  again,  the  model  space  is  a  finite  collection 

of  vectors,  say,  {r . r  ),  with  r  *  (r . .  )T. 

'it  s  Is  Us 

For  convenience,  let  the  distribution  of  p  be 

uniform.  We  use  subscript  i  to  index  observations 

and  j  to  index  possible  values.  Since,  for  each  i, 

Pr  {X  *  xjp)  =  pjt  (3) 

we  can  calculate  the  conditional  distribution  of  p 
given  any  finite  subset  X*  of  X(,  X  ,  ...  as  follows. 

For  j  =  1 .  d,  let  n  be  the  number  of  observed 

X's  in  X*  equal  to  x  fso  that  (n^  n^.  ....  n^l  has  a 
multinomial  distributiin  conditional  on  p.  Using 
Equation  3  and  the  conditional  independence  of  the 
X  given  p,  we  obtain 

Pr  (p  1  r  lx*)  =  K-’  ft  (r  H  for  s=1 . t  (4) 

s'  1-1  IS 

where 


The  joint  distribution  of  any  further  set  of  X  s  is 
given  by  Equations  3  and  4  and  conditional 


•wwsvim  *uwtrrttronor*ronut  w.HffnnvTVJJ’A’,A'w-JJ>’"Arjv  ww  wct  wx  tfi  »ji  .  >va>w.vvv 


I  independence  given  p.  For  example,  the  predictive 

j  distribution  of  a  single  future  X  is 

'  P r  (X  «  x  lx*}  =  I  .  r  Pr  {p  =  r  lx*}  .  (5) 

|  )*  5=1  JS  r  S' 

3.  FINITE  SAMPLE  CALCULATIONS 

>  An  essential  part  of  our  investigations  has  been 

implementation  of  various  discrete-finite  procedures 
as  Fortran  programs. 


distribution  of  a  single  future  observation  for 
various  values  of  n,  d.  m.  and  k.  In  the  calculations 
tabulated  here,  we  made  the  added  restriction  that 
r  >  0  for  all  i  and  s. 

The  Carnegie-Mellon  Statistics  Department  has  an 
attached  processor  which  is  optimized  for  floating 
point  arithmetic  (a  CSPI  Mini-Map  MM-211).  This 
Table  2:  Times  in  Seconds  To  Compute 
Predictive  Distribution 
With  Smoothness  Restrictions 


3.1.  The  Simplest  Case 

With  n  =  15  observations  on  a  variable  assuming 
d  =  10  different  values,  the  amount  of  time  to 
calculate  the  predictive  distribution  of  one  future 
observation  on  a  VAX  11/750  for  various  grid  sizes 
m  is  given  in  Table  1.  The  estimated  time  referred 
to  in  the  last  line  of  Table  1  is  approximately  equal 
to  one  million  years.  This  is  not  to  suggest  that 
we  should  prepare  to  do  calculations  which  require 
this  amount  of  time,  rather  it  suggests  that  serious 
work  is  required  to  find  ways  to  make  inference 
more  feasible  in  a  reasonable  amount  of  time.  We 
consider  one  direction  in  which  this  work  may 
proceed  in  Section  3.2  below.  This  approach  is 
Table  1:  Times  To  Compute  Predictive  Distribution 


m 

ft  of  n's 

Time(Seconds) 

10 

48620 

29.53 

15 

817190 

431.5 

20 

6906900 

3529 

300 

6  x  1016 

3  x  1013(estimated) 

through  restrictions  on  the  model  vectors.  Another 
approach  is  to  use  parallel  processing.  We  discuss 
our  efforts  in  this  direction  in  Section  4  below. 
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#  of  d's 

Time 

15 

10 

20 
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46584 

53 

15 

10 
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617283 
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15 

10 

25 
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15 

10 

30 

3 
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15 

10 

30 
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1856064 

1502 

15 

10 

30 
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2576 

15 

10 
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50 

55 
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1234 

1 

50 

56 

1 

2123274 

9409 

1 

100 

104 

1 

161702 

2156 

1 

100 

105 

1 

3921519 

40732 

processor 

is 

attached 

to  a  VAX 

11/750  and 

is  able 

to  perform  floating  point  calculations  at  up  to  to 
twenty  times  the  speed  of  the  VAX  without 
degrading  the  VAX  performance.  We  have  also 
implemented  the  program  described  above  on  this 
processor.  The  results  were,  at  best,  disappointing. 
Over  a  range  of  problem  sizes  we  obtained  a  speed¬ 
up  (for  this  program)  of  only  25  to  60  percent, 
compared  to  the  VAX  11/750.  This  ’  bad- 
performance  is  explained  by  the  fact  that  the 
attached  processor  is  optimized  for  floating  point 
calculation  and  our  program  has  very  few  such 
calculations  in  it. 


3.2.  Smoothness 

Consider  the  case  in  which  each  observable  must 
equal  one  of  d  equally  spaced  numbers  in  the 
interval  [x1,x<J].  Assume  x  <  x  <  ...  <  x  For 
continuous  probability  models,  it  is  common  to 
expect  that  Pr  (X  near  x)  is  close  to 
Pr  {X  near  x  }  if  jx  -  x|  is  small.  This  property  is 
the  smoothn’ess  of  tf^e  distribution  of  X.  The 
traditional  method  of  guaranteeing  smoothness  is  to 
require  the  distribution  of  X  to  be  a  member  of  a 
parametric  family  of  smooth  distributions. 

The  method  we  choose  for  anticipating 
smoothness  is  to  reduce  the  set  of  model  vectors 
{r^}  by  eliminating  all  those  wit  t  adjacent 
coordinates  which  are  too  far  apart.  This  option 
has  the  potential  for  reducing  the  computational 
burden  dramatically  compared  to  the  first  approach 
and  is.  thus,  the  option  we  will  pursue  in  detail. 
The  reason  is  that  the  amount  of  time  required  to 
calculate  a  predictive  distribution  is  proportional  to 
the  number  of  model  vectors  in  the  calculation. 

There  are  several  ways  to  implement  the 
elimination  of  "rough"  model  vectors.  The  simplest 
is  simply  to  specify  some  value  *  and  allow  only 
those  vectors  r  with  adjacent  coordinates  closer 
than  i.  We  have  written  some  Fortran  programs  to 
implement  this  simple  smoothness  criterion.  For 
example,  consider  the  bound  with  t  =  k/(m-1)  That 
is,  we  eliminate  all  vectors  with  |r  -  r  |  >  «. 
Table  2  below  gives  the  times  in  seconds  ion  an 
11/750)  required  to  compute  the  predictive 


3.3.  Other  Approaches 

One  alternative  way  to  eliminate  "rough"  model 
vectors  is  to  consider  only  those  vectors  r  which 
are  unimodal  in  the  sense  that  (i:  r  >  c)  is  a  set 
of  consecutive  integers  for  all  c.  'sWe  have  also 
programmed  this  unimodality  criterion  for 
comparison.  Table  3  below  shows  the  times 
required  to  calculate  the  predictive  distribution  of  a 
future  observation  based  on  a  sample  of  size  n  *  15 
from  a  distribution  assuming  d  *  10  distinct  values 
tor  various  values  of  m.  The  restriction  that  every 
coordinate  of  p  be  greater  than  0  was  also  imposed 
in  these  calculations. 

One  method  for  choosing  between  the  various 
alternatives  is  to  supply  some  small  set  of 
hypothetical  data  and  produce  the  predictive 

distribution  for  a  future  observation  by  each 
alternative  method.  These  distributions  can  be 
easily  plotted  on  a  terminal  screen  allowing  a  user 
to  choose  the  method  which  produces  the  most 
Table  3:  Times  in  Seconds  To  Compute 
Predictive  Distribution 
With  Unimodality  Restriction 
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H  of  d's 

Time 

20 

806 

1 

30 

23028 

24 

40 

248912 

231 

50 

1604102 

1406 

plausible  predictive  distribution.  The  chosen  method 
can  then  be  applied  to  the  real  data  set.  Of  course, 
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the  amount  of  time  required  to  perform  the 
necessary  computations  will  also  be  an  important 
factor  in  choosing  between  alternatives. 

As  an  illustration,  we  can  compare  the  predictive 
distributions  produced  by  the  models  corresponding 
to 

1.  the  third  row  of  Table  1. 

2.  the  seventh  row  of  Table  2.  and 

3.  the  fourth  row  of  Table  3. 

The  results  were  all  based  on  the  same  sample  of 
n  =  15  observations  of  a  variable  with  d  =  10 

possible  values.  The  fifteen  observations  were  2.  3, 
6,  6,  6.  6.  7.  7,  7,  7,  8,  8,  8,  9.  9,  The  predictive 
distributions  corresponding  to  the  three  tables, 
respectively  are  given  in  Table  4  below. 

Table  4:  Three  Predictive  Distributions 
for  10  Possible  Values 


1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

.02 
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.02 

.21 
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.17 

.13 

.02 

.06 

.08 

.08 

.06 

.07 

.15 

.18 

.15 

.12 

.06 

.03 

.04 

.05 

.06 

.08 

.20 

.24 

.18 

.10 

.04 

It  is 

worth  noting 

that 

the  predictive 

distribution 

derived  from  the  unimodai  restriction  is  itself 
unimodal,  because  most  of  the  data  values  were 
consecutive.  The  distribution  derived  under  the 
smoothness  condition  on  adjacent  values  is 
substantially  smoother  than  the  unrestricted 
distribution,  and  it  is  flatter  than  the  unimodal  one. 


4  PARALLEL  COMPUTATION 

An  interesting  feature  of  our  initial  programs  is 
that  essentially  all  of  the  calculation  is  contained  in 
a  single  loop  which  is  executed  once  for  each 
model  vector.  The  result  of  one  iteration  through 
the  loop  is  independent  of  all  other  iterations 
through  the  loop.  This  means  that  these  calculations 
are  particularly  amenable  to  parallel  computation.  In 
particular,  suppose  there  are  L  processors  available 
to  perform  the  calculations.  Let  W  be  the  time 
required  by  that  part  of  the  computation  which 
cannot  be  performed  in  parallel  and  let  Wp  be  the 
time  required  by  that  part  which  can  be  performed 
in  parallel,  so  that  the  time  required  to  calculate  a 
predictive  distribution  with  one  processor  is 

T,  “  WS  *  WP’ 

where  Wp  is  typically  many  orders  of  magnitude 
larger  than  W  Obviously  L  processors  can 
complete  the  talk  in  time 

\  «  ws  ♦  Wp/l  *  WQ. 

where  W  is  the  added  time  to  handle  the  overhead 
of  the  parallel  processing.  Hence,  considerable 
improvement  in  speed  of  execution  can  be  expected 
if  L  is  large  and  W  and  W  are  small  compared  to 
W  Unfortunately,  no  matter  how  large  L  is  the 
ratio 

VTi  >  (Ws  *  W0>,,WS  *  Wp*  >  °4 


4.1.  A  Network  of  VAXes 

We  have  actually  begun  implementing  some  of 
our  algorithms  on  a  system  with  truly  parallel 
architecture.  The  Department  of  Statistics  at 


Carnegie-Mellon  University  has,  in  addition  to  its 
VAX  11/750,  several  Microvax  I  s  and  Microvax  It's 
which  communicate  via  DECnet  over  Ethernet  cables. 
We  have  developed  a  set  of  FORTRAN  subroutines 
which  allow  us  to  create,  on  each  of  these 
machines,  processes  which  can  communicate  with 
each  other  and  divide  up  the  work  in  an  efficient 
manner  so  as  to  dramatically  reduce  the  elapsed 
time  required  to  perform  the  calculations  described 
in  Section  3.  In  fact,  the  system  is  general  enough 
to  be  able  to  handle  any  problem  in  which  the  task 
is  decomposable  into  parts  which  can  be  perfomed 
independently  of  each  other. 

The  system  we  are  using  works  as  follows. 
There  are  essentially  two  programs  and  there  are 
two  types  of  processes  on  the  network.  One 
process  will  be  called  the  parent,  while  the  others 
will  be  called  children.  One  of  the  two  programs  is 
run  by  the  parent,  and  the  other  program  is  run  by 
all  of  the  children  simultaneously.  Both  programs 
are  required  to  perform  interprocess  communication, 
which  will  be  described  in  more  detail  below.  The 
parent  program  divides  the  set  of  model  vectors 
into  groups  which  we  call  messages,  and  sends  the 
messages  to  the  children  via  the  communication 
network.  Each  message  consists  of  sufficient 
information  for  the  child  to  construct  the  model 
vectors  assigned  to  it  and  to  calculate  summands 
which  will  be  added  together  by  the  parent  to  obtain 
the  predictive  likelihood  and  the  predictive 
distribution  of  a  future  observation.  When  a  child 
finishes  its  work,  it  sends  its  calculations  back  to 
the  parent  which  combines  them  in  an  appropriate 
fashion. 

Making  sure  that  the  above  scheme  runs 
smoothly  requires  careful  attention  to  details. 
Because  our  system  consists  of  three  different  kinds 
of  VAX  processors,  timing  can  be  a  serious 
problem.  For  example,  a  Microvax  I  is  slower  than 
a  VAX  11/750,  which  is  slower  than  a  Microvax  II. 
It  makes  sense  to  assign  the  slower  machines  less 
work  so  that  all  children  finish  at  about  the  same 
time.  Since  our  system  is  in  constant  use  by  other 
researchers  in  the  Department  of  Statistics,  different 
processors  will  be  subject  to  different  demands  on 
resources  at  different  times.  In  order  to  avoid 
having  a  calculation  delayed  by  one  or  two  slow 
processors,  a  flexible  system  of  message 
distribution  is  required. 

We  have  chosen  to  divide  the  work  into  a  large 
number  of  messages  and  to  send  them  to  the 
children  as  they  are  needed.  That  is,  when  a  child 
finishes  a  message  and  returns  the  results  to  the 
parent,  the  parent  then  sends  the  child  the  next 
available  message.  The  goal  of  creating  the 
messages  is  to  make  them  approximately  the  same 
sire  while  keeping  the  amount  of  effort  required  of 
the  parent  to  create  each  message  relatively  small. 
The  reason  is  that  a  child  may  have  to  wait  while 
the  parent  creates  its  next  message.  There  are 
advantages  and  disadvantages  to  having  small 
message  sires.  One  advantage  of  small  sires  is 
that  a  slow  child  (either  a  slow  CPU  or  a  busy 
machine)  will  receive  only  a  few  messages,  leaving 
the  bulk  of  the  work  to  be  performed  by  those 
children  who  have  the  time  and  resources  to  do  it. 

An  added  advantage  to  partitioning  the 
computation  into  small  parts  is  that  the  numbers 
being  added  together  are  more  nearly  the  same  sire 
than  with  a  serial  algrorithm.  This  increases 
numerical  accuracy  in  a  manner  similar  to  the 
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pairwise  algorithm  described  by  Chan,  Golub,  and 
LeVeque  (1983).  See  also.  Eddy  and  Jones  (1985). 

On  the  negative  side,  small  message  sizes  can 
cause  a  communications  bottleneck  for  the  parent. 
In  order  to  make  optimal  use  of  resources,  we  have 
chosen  to  make  the  processor,  on  which  the  parent 
program  runs,  also  run  the  child  program.  That  is, 
one  computer  communicates  with  itself  over  DECnet 
as  if  it  were  a  remote  node.  In  order  for  this  child 
to  get  any  work  done,  the  parent  will  have  to  spend 
most  of  its  time  in  a  quiescent  state.  A  definite 
advantage  to  small  messages  appears  when  one 
considers  the  possibility  of  one  child  "dying" 
prematurely.  All  work  done  by  that  child,  since  it 
last  reported  results  to  the  parent,  is  lost  and  must 
be  redone.  Since  the  network  is  not  as  reliable  as 
one  would  wish,  and  systems  sometimes  crash  for 
unanticipated  reasons,  it  pays  to  have  small 
messages. 

One  drawback  to  our  system  occurs  when  the 
number  of  processors  becomes  very  large.  The 
bandwidth  of  the  communications  channel  will  only 
handle  a  certain  number  of  processors  before 
becoming  overloaded.  Also,  the  parent  must  have  a 
connection  to  every  other  processor,  which  can  tax 
the  resources  of  the  processor  on  which  the  parent 
runs.  This  problem  severely  limits  the  size  of  the 
system  in  theory,  but  few  organizations  own  enough 
stand-alone  systems  to  encounter  difficulty  due  to 
the  bandwidth  (for  calculations  of  this  kind.) 


4.2.  Description  of  the  System 

The  system  we  use  is  an  example  of  a  data  flow 
system  as  described  by  O'Leary  and  Stewart  (1985). 
The  concept  of  a  data  flow  algorithm  is  that,  once 
the  program  has  started,  the  flow  of  data  takes  care 
of  the  distribution  of  work  and  the  control  of  the 
program.  In  our  system,  after  each  child  begins 
work,  program  control  is  handled  by  the  return  of 
results  from  the  children  which  initiates  the 
subsequent  sending  of  the  next  message.  In  the 
paragraphs  below,  we  describe  in  detail  the  different 
features  of  the  network  of  VAXes. 

4.2.1.  Network  communication 

Communication  between  parent  and  child  is  done 
over  DECnet.  There  are  two  communication 
channels,  one  called  the  mailbox  channel  and  the 
other  called  the  data  channel.  The  mailbox  channel 
is  used  to  keep  track  of  the  network  status.  For 
example,  when  each  child  is  created,  the  parent 
sends  no  data  until  it  receives  a  message  in  the 
mailbox  saying  that  the  child  has  come  to  life. 
When  a  child  dies,  the  parent  gets  a  mailbox 
message  saying  the  child  is  dead,  and  the  parent 
must  reassign  the  task  on  which  the  child  was 
working,  if  there  was  one.  The  data  channel  is  what 
the  parent  uses  to  send  its  messages  (data)  to  the 
child  and  what  the  child  uses  to  return  its  results. 

4.2.2.  Asynchronous  communication 

The  parent  only  deals  with  data  arriving  from  a 
child  when  it  arrives.  That  is,  the  parent  does  not 
wait  for  messages  or  data,  but  rather  issues 
asynchronous  read  requests  and  then  goes  on  with 
its  work.  For  example,  upon  starting  a  child,  the 
parent  opens  the  data  and  mailbox  channels  and 
issues  an  asynchronous  read  for  a  mailbox  message. 
Then  it  goes  to  the  next  child  and  does  the  same 


until  it  runs  out  of  children  or  is  interrupted  by  the 
arrival  of  a  mailbox  message.  When  it  reads  a 
mailbox  message  saying  that  the  child  is  alive,  it 
issues  an  asynchronous  read  for  a  mailbox  message, 
sends  data  to  the  child,  and  then*  issues  an 
asynchronous  read  for  data.  When  the  parent  reads 
data  returning  from  a  child,  it  accumulates  the 
results,  sends  more  data,  and  issues  another 
asynchronous  read  for  more  data.  The  child  on  the 
other  hand,  operates  synchronously  by  reading  data 
from  the  parent,  performing  its  computations, 
writing  the  results  back  to  the  parent,  and  waiting 
for  the  next  set  of  data.  The  key  to  this  system 
working  is  that  (i)  the  parent  goes  back  to  whatever 
it  was  doing  after  it  issues  an  asynchronous  read 
request  and  (ii)  when  an  asynchronous  read  request 
is  answered,  the  parent  is  interrupted  from  what  it 
was  doing  and  deals  with  what  it  reads.  (The  one 
exception  to  this  is  that  if  the  parent  is  already 
reading  the  answer  to  an  asynchronous  read  when 
another  one  is  also  answered,  the  second  and  all 
later  ones  queue  up  and  are  dealt  with  in  order  of 
arrival). 

4.2.3.  The  parent  process 

Because  the  answers  to  asynchronous  reads 
interrupt  the  parent  and  begin  execution  of  a 
seperate  set  of  code,  they  behave  like  subprocesses. 
In  fact,  the  program  flow  following  one  of  these 
asynchronous  answers  is  completely  seperate  from 
the  basic  parent  program.  The  basic  parent  program 
consists  solely  of  the  following: 

1.  Initialize  with  input  data. 

2.  Loop  through  the  children  one  at  a  time. 

•  Open  a  link,  if  it  is  not  currently 
open. 

•  Issue  asynchronous  read  request  for 
mailbox  message. 

3.  Wait  some  fixed  amount  of  time. 

4.  Return  to  step  2. 

All  of  the  data  handling  is  done  by  the 
subprocesses  described  below.  Each  subprocess  is 
initiated  when  a  read  is  answered  by  one  of  the 
children.  Hence,  the  subprocess  is  associated  with 
that  child  for  the  duration  of  its  existence. 

4.2.4.  The  mailbox  subprocess 

The  first  thing  the  mailbox  subprocess  does  is 
check  to  see  if  the  message  is  a  birth  message  or  a 
death  message.  A  birth  message  means  that  the 
child  is  alive.  In  this  case,  data  is  sent  to  the  child 
and  an  asynchronous  read  on  the  data  channel  is 
issued  for  the  results.  A  death  message  means  that 
the  child  is  dead,  and  must  be  removed  from  the 
list  of  living  children.  In  addition,  if  the  child  had 
been  working  on  a  set  of  data,  the  data  must  be 
requeued  for  delivery  to  another  child  at  a  later 
time.  If,  on  the  other  hand,  the  child  is  dead 
because  the  parent  killed  it  (when  there  is  no  data 
left  to  be  sent),  we  need  only  remove  it  from  the 
list  of  living  children.  When  the  last  living  child 
dies  and  there  is  no  data  to  be  sent,  results  are 
summarized  and  execution  ceases. 


4.2.5.  The  data  subprocess 

The  data  subprocess  begins  when  the 
asyncronous  read  for  the  first  set  of  results  issued 
by  the  mailbox  subprocess  is  answered.  At  that 
time,  the  results  are  accumulated.  If  there  is  no 
more  data,  a  special  data  set  is  sent  to  the  child 
which  causes  it  to  cease  execution  and  send  a  death 
message  back  to  the  parent  on  the  mailbox  channel. 
If  there  is  still  data  to  be  sent,  the  child  receives 
the  next  packet  of  data.  The  subprocess  then  issues 
an  asynchronous  read  for  the  results  and  quits. 
Notice  that  this  last  part  of  the  subprocess  is 
identical  to  part  of  the  mailbox  subprocess.  In  fact 
these  two  subprocess  use  the  same  code  (which  is 
reentrant  for  this  purpose). 


4.3.  Empirical  Study  of  the  System 

Our  initial  investigations  of  the  network  of 
VAXes  has  dealt  with  the  questions  of  how  much 
improvement  do  we  get  with  more  processors,  and 
how  large  should  each  message  be  to  make  the  best 
use  of  system  resources.  The  example  we  used  for 
comparisons  had  n=14  observations  on  a  variable 
assuming  d=19  values  <0  to  9  in  steps  of  .5)  with  a 
model  space  having  m=22  and  a  smoothness 
criterion  k-2  as  described  in  Section  3.2.  The  total 
number  of  model  vectors  is  38.226,040  and  the 
calculation  of  the  predictive  distribution  of  one 
future  observation  took  11  hours  and  8  minutes  on  a 
single  Microvax  II  (40.100  seconds). 

We  ran  the  same  case  under  several  different 
conditions  using  the  network  of  VAXes  described 
above.  Since  the  11/750  runs  about  80%  as  fast  and 
the  Microvax  I's  run  about  20%  as  fast  as  the 
Microvax  ll's,  we  constructed  two  systems.  The 
first  system  had  8  nodes:  six  Microvax  M's  the 
11/750  and  one  Microvax  I.  This  is  roughly 
equivalent  to  seven  Microvax  ll's.  The  second 
system  had  15  nodes:  eight  Microvax  ll’s,  the  11/750 


and  six  Microvax  I's.  This  is  roughly  equivalent  to 
ten  Microvax  ll's. 

The  best  timing  obtained  for  the  15  node  system 
was  4,303  seconds,  and  the  best  timing  obtained  by 
the  8  node  system  was  5,771  seconds.  These 
numbers  are  10.7%  and  14.4%  of  the  time  taken  by  a 
single  Microvax  II.  This  is  remarkably  close  to  the 
best  we  could  have  expected,  namely  10%  and 
14.3%.  The  time  which  the  parent  spent  on  each 
message  varied  from  .13  to  .15  seconds  regardless 
of  which  system  was  used  or  how  many  messages 
were  sent. 

Figure  1  shows  the  total  elapsed  time  for  each 
system  for  several  different  sizes  of  message.  The 
horizontal  axis  is  Log  (Number  of  messages)  rather 
than  message  size,  fhe  pattern  verifies  our  initial 
conclusion  that  when  there  are  few  messages  (hence 
large  ones),  time  will  be  wasted  waiting  for  the  last 
slow  machine  to  finish  its  last  message.  And  when 
there  are  too  many  messages  (hence  very  small 
ones),  time  is  wasted  while  the  parent  deals  with  all 
of  the  asynchronous  read  requests  being  answered 
almost  immediately.  It  was  generally  true  that  the 
Microvax  I's  received  about  one-fifth  as  many 
messages  as  the  Microvax  ll's  when  there  were 
many  messages,  but  not  so  many  that  each  message 
was  returned  completed  immediately. 

Finally,  we  examined  the  numerical  accuaracy  of 
the  computations.  All  of  the  calculations  for  the 
cases  plotted  in  Figure  1  were  done  in  double 

precision.  We  also  ran  the  test  case  on  the  array 
processor  attached  to  the  11/750,  which  only  does 
single  precision  arithmetic.  The  computation  took 
35,036  seconds,  and  was  incorrect  by  as  much  as 
5%  in  some  of  the  coordinates  of  the  predictive 
distribution.  This  was  verified  by  running  the  case 

serially  in  single  precision  on  a  Microvax  II  and 

obtaining  an  identical  posterior  distribution  (in  36,642 
seconds)  despite  the  fact  that  VAXes  and  the 

attached  array  processor  do  not  round  identically 


Figure  1:  Seconds  vs.  Ln (#  of  Messages) 
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(c.f.  Eddy  and  Jones,  1985).  The  parallel  algorithm, 
however,  was  able  to  overcome  the  rounding  error 
problem.  We  ran  the  same  case  in  parallel  with 
approximately  the  square  root  of  the  total  number 
of  model  vectors  per  message  (6237  model  vectors 
per  message  in  6129  messages),  but  still  in  single 
precision.  The  predictive  distribution  was  identical 
to  the  double  precision  result  to  five  significant 
digits.  This  case  took  37,151  seconds  of  total  CPU 
time  on  Microvax  ll's.  The  message  size  was 
chosen  to  be  optimal  for  numerical  accuracy,  not 
for  speed. 


4.4.  Monitoring  the  System 

In  order  to  get  the  most  out  of  the  network  of 
VAXes,  it  may  be  useful  to  monitor  the  network 
activity.  We  have  arranged  for  a  terminal  screen  to 
display  the  current  status  of  each  child  as  shown  in 
Figure  2.  One  could  display  any  information  which 
seemed  relevant  in  such  a  display.  We  have  chosen 
to  display  the  following: 

•  The  name  of  the  node  on  which  the  child 
is  running  and  the  time  of  the  most 
recent  message  on  the  first  line. 

•  Text  describing  the  most  recent  message 
on  the  second  line. 

•  Numbers  of  messages  on  the  third  line. 

The  first  number  counts  all  incoming  and 
outgoing  messages  to  and  from  the 
parent.  The  second  number  is  the 
number  of  data  messages  sent  to  that 
particular  child.  The  third  number  is  the 
total  number  of  data  messages  sent  to 
all  children  so  far.  The  number  -1  in 


either  of  these  last  two  indicates  that  the 
communication  link  between  parent  and 
child  is  not  open. 

6y  use  of  such  a  monitoring  device,  one  can  see  if 
attention  needs  to  be  paid  to  a  particular  node 
because  the  connection  is  not  open.  It  also  lets  the 
user  know  if  some  of  the  children  are  not  doing 
their  fare  share  of  the  work.  If  the  user  knows  how 
many  messages  will  be  sent,  it  also  indicates  how 
much  of  the  problem  has  been  completed. 


5.  FUTURE  DIRECTIONS 

The  network  of  VAXes  can  actually  be  extended 
to  include  other  kinds  of  processors,  so  long  as  a 
communications  protocol  is  available.  In  addition, 
any  problem  which  is  decomposable  into  arbitrary 
sized  pieces  can  be  handled  in  parallel  using  the 
system.  We  explored  discrete-finite  inference  in 
detail  because  it  was  the  problem  which  first 
interested  us  in  the  system. 

The  most  important  thing  still  needed  is  a 
thorough  analysis  of  system  performance.  It  is 
straightforward  to  see  that,  if  all  processors 
perform  their  calculations  at  fixed  (albeit  different) 
rates,  the  elapsed  time  will  be  minimized  by  sending 
only  one  message  to  each  processor  with  sizes 
being  an  increasing  function  of  the  rates.  The 
reason  for  this,  is  that  the  overhead  of  sending 
messages  is  nearly  independent  of  the  size  of  the 
message.  Of  course,  this  might  not  be  true  for 
other  applications.  Our  system,  however,  is  part  of 
a  time  sharing  system,  hence  calculations  are 
performed  at  different  rates  by  the  same  processor 
at  different  times.  Sending  a  single  message  to 
each  processor  can  be  disastrous  in  this  case,  if 
one  of  the  processors  has  considerably  more  other 
work  to  do  the  others.  Evidence  of  this  appears  at 


Figure  2:  Typical  Terminal  Screen  While  Monitoring  Network  Status 
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the  left  hand  margin  of  Figure  1.  We  intend  to 
perform  a  probabilistic  analysis  of  the  system 
performance,  beginning  with  the  case  of  discrete- 
finite  inference,  but  hopefully  extending  to  other 
types  of  problems  as  well. 
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A  BRIEF  OVERVIEW  OF  COMPUTER  ASSISTED  MEDICAL  DECISION  MAKING 


Richard  H.  Jones,  University  of  Colorado 


ABSTRACT 

Decision  trees  are  one  of  the  most  commonly 
used  tools  in  medical  decision  making.  This  talk 
briefly  explains  the  use  of  decision  trees,  how 
Bayesian  statistical  methods  are  used  to  convert 
prior  probabilities  before  a  test  into  posterior 
probabilities  after  a  test,  and  how  costs  or 
utilities  play  an  important  role.  As  an  example, 
a  decision  tree  for  a  patient  presenting  with 
jaundice  will  be  discussed.  Three  diagnostic 
tests  for  two  diseases  are  considered. 

1.  Introduction 

The  purpose  of  this  paper  is  to  provide  a 
brief  introduction  to  decision  trees,  as  used  in 
medical  decision  making,  for  statisticians  and 
computer  scientists  from  the  point  of  view  of  a 
statistician.  The  author  is  not  an  expert  in 
medical  decision  making,  but  has  been  working 
with  two  surgeons  at  the  University  of  Colorado, 
Ben  Eiseman,  M.D.  and  Brad  Borlase,  M.D.,  and  a 
Biometrics  graduate  student,  Maureen  Haschke,  to 
develop  small,  clinically  useful,  decision  trees 
for  use  in  surgical  practice.  Part  of  this 
effort  is  being  supported  by  Rose  Medical  Center, 
Denver.  Much  of  the  terminology  used  in  medical 
decision  making  is  different  from  the  usual  sta¬ 
tistical  terminology.  A  second  purpose  of  this 
paper  is  to  explain  the  medical  terminology  in 
statistical  terms. 

Decision  trees  have  been  used  for  medical  de¬ 
cision  making  for  a  number  of  years.  A  standard 
reference  on  the  subject  is  Weinstein  and  Fine- 
berg  (1980).  A  recent  book  giving  decision  trees 
for  use  in  surgical  practice  is  Norton  and  Eise¬ 
man  (1986).  The  Society  for  Medical  Decision 
Making  is  a  very  active  society  that  holds  an 
annual  meeting  and  publishes  the  journal  Medical 
Decision  Making. 

2.  An  Example 

Figure  1  shows  an  example  of  a  decision  tree 
for  use  in  diagnosing  a  patient  who  presents  at 
a  surgical  practice  with  jaundice.  The  two  pos¬ 
sible  diseases  considered  in  this  example  are 
gall  stones  (GS)  and  pancreatic  cancer  (PC).  Two 
initial  tests  are  ultrasound  (US)  and  CT  scan 
(CT).  A  more  invasive  test,  ERCP,  can  be  used  at 
a  later  stage  to  confirm  or  rule  out  a  diagnosis. 
Decision  nodes  are  shown  as  filled  in  squares, 
chance  nodes  as  open  diamonds,  and  terminal  nodes 
as  filled  in  rectangles  followed  by  an  expected 
cost.  Probabilities  are  associated  with  chance 
nodes. 

After  obtaining  the  patients  history  and  doing 
a  physical  examination,  the  physician  enters  sub¬ 
jective  probabilities  (priors)  for  the  various 
possible  outcomes.  The  table  below  shows  an  ex¬ 
ample  for  the  two  disease  case. 
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Table  1:  Subjective  prior  probabilities  for  the 
two  disease  case.  (The  bar  denotes  absence  of 
the  disease.) 

In  this  example  the  physician  feels  85  percent 
certain  that  this  particular  patient  has  disease 
1  and  not  disease  2.  The  probabilities  at  the 
first  chance  nodes  depend  on  the  prior  probabili¬ 
ties  and  the  sensitivities  and  specificities  of 
the  test  for  each  disease.  This  is  shown  schema¬ 
tically  in  Figure  2  for  the  case  a  single  test 
for  a  single  disease.  While  this  diagram  and  the 
terms  used  are  very  common  in  the  fields  of  epi¬ 
demiology  and  medicine,  some  translation  may  be 
necessary  for  statisticians  and  computer  scien¬ 
tists  who  have  not  worked  in  these  fields. 

In  epidemiology,  the  prior  probability  of  a 
disease  is  referred  to  as  the  prevalence  of  the 
disease  in  the  population  being  considered.  This 
has  the  usual  frequency  interpretation  of  the 
proportion  of  people  in  the  population  with  the 
disease.  In  the  example  presented  here,  the 
prior  probabilities  are  the  physician's  subjective 
assessment  of  the  chances  that  a  certain  patient 
has  the  disease.  In  most  medical  decision  trees, 
the  prevalence  or  frequency  interpretation  of 
prior  probabilities  are  used.  Defining  a  popula¬ 
tion  as  the  people  who  come  to  a  certain  surgical 
practice  with  jaundice,  the  prevalence  of  gall 
stones  is  the  proportion  of  these  people  with  gall 
stones.  This  prevalence  is  baseline  information 
for  the  physician.  After  taking  the  patient's 
history  and  doing  a  physical  examination,  the  phy¬ 
sician  has  much  more  information  about  this  pa¬ 
tient.  This  information  is  much  more  subjective 
since  there  is  no  longer  a  population,  but  a  uni¬ 
que  person. 

The  sensitivity  of  a  test  is  the  conditional 
probability  that  a  person  with  the  disease  will 
test  positive  for  the  disease.  The  specificity  of 
the  test  is  the  conditional  probability  that  a 
person  without  the  disease  will  test  negative.  A 
good  test  will  have  both  the  sensitivity  and  spec¬ 
ificity  near  1  being  both  sensitive  and  specific 
for  the  disease. 

The  probabilities  of  the  outcomes  of  a  test  are 

P(T+)  =  P(D)P(T+|D)+P(D)P(T+|D) 

=  P(D)Se+P(D)(l-Sp) 

P(T-)  -  P(D)P(T-|D)+P(D)P(T-|D), 

-  P(D)Sp+P(D) (1-Se) 

and  these  are  assigned  to  a  two  outcome  chance 
node.  Se  and  Sp  denote  the  sensitivity  and  spec¬ 
ificity  of  the  test.  If  the  test  is  sensitive 
and  specific  for  two  different  diseases,  and  the 
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Sensitivity  =  Prob ( T+ : D )=TP/ (TP+FN) 
Specificity  =  Prob < T- : D >=TN/ ( TN+FP > 


Figure  2.  Two  by  two  table  of  test  results  and 
definitions  of  sensitivity  and  specificity. 

outcomes  of  the  test  for  the  two  diseases  are 
statistically  independent,  a  typical  probability 
would  be 

P(+-)  -  P(D1D2)Se1(l-Se2)+P(D1D2)Se1Sp2 

+  P(D1D2) (l-SPl) (1-Se2)+P(DXD2) (l-SPl)Sp2. 

Assumptions  of  statistical  independence  are  often 
questionable.  It  is  more  likely  that  results  of 
a  single  test  for  two  different  and  unrelated 
diseases  are  statistically  independent  than  that 
the  results  of  two  different  tests  for  the  same 
disease  are  statistically  independent.  The  re¬ 
sults  of  two  different  tests  can  be  influenced 
by  the  stage  of  the  disease.  As  the  tree  is  tra¬ 
versed,  both  forms  of  statistical  independence 
are  assumed.  The  only  way  to  avoid  these  assump¬ 
tions  is  to  obtain  joint  sensitivity  and  specifi¬ 
city  data  for  multiple  tests  and  multiple  diseas¬ 
es. 

The  next  step  is  called  probability  revision 
by  medical  decision  makers.  This  is  the  applica¬ 
tion  of  Bayes'  Rule  to  obtain  posterior  probabil¬ 
ities  after  a  te3t  is  carried  out.  The  usual 
medical  or  epidemiological  terminology  for  post¬ 
erior  probabilities  is  predictive  value  of  a  test: 


Referring  to  the  two  by  two  table  in  Figure  2, 
predictive  values  are  conditional  probabilities 
obtained  by  normalizing  across  rows  while  the 
sensitivity  and  specificity  are  conditional  pro¬ 
babilities  by  column.  For  one  test  and  two  dis¬ 
eases,  a  similar  four  by  four  table  can  be  con¬ 
structed. 

If  the  tree  is  traversed  forward,  probabili¬ 
ties  can  be  assigned  to  all  chance  nodes  by  recur¬ 
sive  application  of  Bayes'  Rule.  Today's  poster¬ 
ior  becomes  tomorrow's  prior. 

Costs  are  assigned  to  the  terminal  nodes  which 
include:  1)  costs  of  tests,  2)  costs  of  hospita¬ 
lization,  3)  cost  of  incorrect  diagnosis.  The 
last  is  an  expected  cost  since  it  involves  the 
probabilities  of  various  outcomes. 

The  tree  is  then  folded  back  to  obtain  the  ex¬ 
pected  cost  at  each  node.  At  a  decision  node, 
the  branch  with  the  minimum  expected  cost  is  used. 

The  decision  tree  in  Figure  1  was  developed 
using  the  software  package  SMLTREE.  It  has  been 
decided  to  program  this  type  of  tree  in  C.  The 
goal  is  to  develop  a  number  of  relatively  small 
special  purpose  decision  trees  such  as  this  for 
use  in  clinical  practice.  The  actual  tree  will 
be  hidden  from  the  physician's  view.  A  series  of 
screens  will  display  only  the  information  rele¬ 
vant  to  the  physician. 
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In  learning  medical  diagnosis, 
medical  students  learn  typical  presen¬ 
tations  of  diseases,  but  there  is 
little  formal  learning  about  what 
weight  should  be  given  diagnostic 
information  in  making  a  diagnosis  or  in 
distinguishing  one  disease  from 
another.  Furthermore,  although  they 
may  learn  the  outcome  of  a  case  stu¬ 
dents  are  rarely  given  feedback  on  how 
well  they  have  combined  clinical 
information  in  making  diagnostic  or 
therapeutic  judgements. 

On  the  other  hand,  Hammond  has  shown 
that  learners  who  were  given  feedback 
on  how  they  appeared  to  have  weighted 
information  in  making  judgements 
learned  much  more  effectively  than 
those  given  only  the  correct  outcome. 
(1) 

To  test  whether  this  kind  of 
feedback  would  have  its  predicted 
effect  on  the  learning  of  medical 
diagnosis,  we  developed  a  microcomputer 
program  to  present  simulated  cases, 
obtain  students'  judgements  for  each 
case  and  then,  after  several  cases, 
display  the  student's  apparent  weight¬ 
ing  of  information  along  with  the 
correct  weighting.  Initial  trials  have 
indicated  that  his  model  is  effective 
in  teaching  diagnostic  relationships. 

In  this  paper,  I  will  describe  two 
applications  of  this  program  to  medical 
decision  making. 

Design  of  Microcomputer  Program 

The  microcomputer  program  operates 
by  presenting  simulated  clinical  cases 
and  asking  for  a  diagnostic  or  thera¬ 
peutic  judgement  from  the  physicians  or 
student  in  the  form  of  an  interval 
scaled  measure.  After  each  case,  the 
student  is  given  outcome  feedback  in 
the  form  of  a  score  or  probability 
calculated  from  a  linear  model.  These 
models  have  either  been  derived 
empirically  from  analysis  of  large 
clinical  populations  or,  where  feas¬ 
ible,  from  published  rules  which  have 
been  validated  on  other  populations. 

The  cases  are  constructed  by  loading 
descriptions  of  various  levels  of 
severity  of  each  of  the  variables  and 
using  a  fractional  factorial  design  to 
determine  the  levels  for  each  case. 

After  each  series  of  simulated 
cases,  the  relationship  between  the 
clinical  variables  and  the  judgements 
made  is  calculated  using  dummy  variable 
regression  analysis  and  this  result  is 
presented  to  the  student  as  a  bar  graph 
comparing  the  apparent  weighting  on  the 


previous  cases  to  that  recommended  by 
the  model.  Weights  are  expressed  as 
percentage  of  total  weight. 

Both  outcome  feedback  and  feedback 
of  weighting  can  be  turned  on  or  off 
under  program  control  from  any  of  15 
possible  outcome  measures.  This  allows 
investigation  of  diagnostic  and 
treatment  decisions  as  well  as  indepen¬ 
dent  predictions  of  the  likelihood  of 
various  diseases  given  the  same  cases 
(differential  diagnosis). 

Application  to  Medical  Diagnosis 

We  first  applied  the  method  to 
examining  the  apparent  weighting  of 
clinical  information  by  physicians  in 
diagnosing  pulmonary  embolus. (2)  (We 
refer  to  this  as  "apparent"  weighting 
because  it  is  not  known  whether  a 
linear  regression  model  is  at  all 
similar  to  how  people  use  information 
in  making  such  judgements.)  The 
initial  studies  showed  that  physicians 
used  these  clinical  factors  in  highly 
variable  ways  and  showed  great  hetero¬ 
geneity  in  apparent  weighting.  The 
average  weights  were  also  different 
from  those  derived  from  analysis  of 
actual  cases. 

Diagnosing  Urinary  Tract  Infection 

The  first  application  of  the  inter¬ 
active  features  of  the  program  tested 
whether  medical  students  would  learn 
the  diagnosis  of  urinary  tract  infec¬ 
tion  more  effectively  if  given  feedback 
of  weighting. (3) 

The  reference  model  was  derived  from 
analysis  of  records  of  228  patients 
suspected  of  urinary  tract  infection, 
seen  in  the  Emergency  Department  of  the 
University  of  Nebraska  Hospital.  This 
was  later  expanded  to  750  patients.  A 
five  item  rule  was  derived  initially 
using  discriminant  analysis  with  the 
urine  culture  results  as  the  outcome 
variable.  This  rule  predicted  the 
correct  outcome  in  80%  of  patients  in  a 
subsequent  validation  set. (4) 

The  variables  were  defined  for  each 
of  three  levels  and  the  cases  were 
displayed  according  to  an  underlying 
fractional  factorial  design  with  18 
i terations. (5) 

After  viewing  each  case,  the  student 
estimated  the  likelihood  the  urine 
culture  would  be  positive  and  whether 
they  would  begin  antibiotic  therapy. 
After  each  case,  they  received  outcome 
feedback  in  the  form  of  the  probability 
calculated  from  the  rule  derived  from 
the  actual  cases. 
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After  each  set  of  18  cases,  the 
students  were  given  a  display  of  their 
weighting  compared  with  that  of  the 
model  in  the  form  of  bar  graphs.  After 
viewing  the  display  and  determining  how 
his  apparent  weighting  differed  from 
the  model,  the  student  would  proceed  on 
to  18  new  cases.  Each  student  in  this 
study  completed  3  sets  of  18  cases. 

We  compared  learning  for  students 
who  received  this  feedback  with  those 
who  did  not.  Second  year  students  were 
allocated  to  two  groups.  A  control 
group  received  only  the  calculated 
probability  after  each  case  and  the 
experimental  group  received  the  graphic 
display  of  weighting. 

Both  groups  began  the  study  with  an 
average  correlation  between  their 
probability  estimates  and  those 
calculated  from  the  model  of  .55.  The 
experimental  group  learned  more  rapidly 
through  the  three  sets,  achieving  a 
correlation  of  .80,  while  the  control 
showed  some  learning  but  achieved  a 
correlation  of  only  .67  at  the  end. 

The  improvement  was  accompanied  in  the 
experimental  group  by  convergence  of 
their  average  weighting  on  the  weights 
used  in  the  model. 

Diagnosing  Streptococcal  Pharyngitis 

In  a  second  experiment,  we  looked  at 
the  effect  of  these  types  of  feedback 
on  the  calibration  of  predictions  as 
well  as  how  well  the  model's  weighting 
was  learned.  Poses  and  colleagues  had 
studied  how  11  student  health  physi¬ 
cians  at  the  University  of  Pennsylvania 
used  clinical  information  in  diagnosing 
and  treating  streptococcal  pharyn¬ 
gitis.  (6)  He  found  these  experienced 
physicians  greatly  overestimated  the 
probability  of  streptococcal  pharyn¬ 
gitis  in  their  cases.  The  apparent 
strategies  of  these  physicians  had  been 
extensively  studied.  Logistic  regres¬ 
sion  was  used  to  model  the  relationship 
of  clinical  findings  to  the  predicted 
and  actual  culture  outcome.  We  decided 
to  ask  these  same  physicians  to  test 
the  effect  of  computer  feedback  of 
weighting  on  their  subsequent  diag¬ 
nostic  performance. 

A  well  validated  decision  rule  for 
predicting  streptococcal  pharyngitis 
had  been  described  by  Centor  (7)  and 
was  used  as  the  model  for  the  learning 
exercise.  The  rule  gives  equal  weight 
to  4  items:  fever,  absence  of  cough, 
tonsillar  exudate  and  enlarged  anterior 
cervical  lymph  nodes.  Three  additional 
variables,  important  to  the  physicians, 
were  added  but  given  no  weight  in  the 
rule. 

Simulated  cases  were  constructed 
from  a  fractional  factorial  design  with 
2  levels  and  12  cases.  Each  case 
represented  each  of  the  variables  as 
either  present  or  absent. 


Outcome  feedback  was  again  given  as 
calculated  probability.  After  each  12 
cases,  the  weighting  calculated  from 
the  answers  was  compared  with  that 
suggested  by  the  rule. 

The  study  consisted  of  12  cases  with 
no  feedback  as  a  baseline  measure 
followed  by  a  lecture  explaining  the 
rule  and  its  rationale. 

At  one,  two,  and  six  months  later, 
there  were  paired  sessions  of  12  cases 
each.  Each  of  the  three  pairs  of 
sessions  consisted  of  12  cases  with 
outcome  feedback  after  each  case 
followed  by  lens  model  feedback.  This 
was  followed  by  12  more  cases  also  with 
feedback. 

As  in  the  previous  study,  the 
weights  calculated  from  the  judgements 
made  by  the  physicians  began  to 
converge  on  the  model  weights  as  the 
study  progressed.  Correlation  of  the 
physicians'  predictions  with  the 
likelihood  calculated  hy  the  rule  rose 
rapidly  at  first  then  continued  a  slow 
increase  after  two  months.  The  mean 
probability  estimates  corrected 
rapidly.  The  decision  rule  had  been 
corrected  for  the  5%  prevalence  of 
positive  cultures  in  the  student  health 
population  and  the  mean  for  the 
simulated  cases  was  6.5%.  At  first, 
there  was  considerable  overestimation 
(24%)  hut  the  correct  probability  was 
reached  after  the  first  month. 

A  measure  of  calibration  is  the 
regression  of  the  estimated  probability 
on  the  actual  probability;  with  perfect 
calibration  falling  on  a  diagonal  line 
with  a  slope  of  1  and  an  intercept  of 
0.  Thus,  if  calibration  improves,  the 
slope  approaches  1  and  the  intercept  0. 
In  the  early  cases  of  this  study,  the 
intercept  began  at  2.7  and  progres¬ 
sively  declined  to  equal  1.05  at  the 
last  session.  Similarly,  the  slope 
began  at  .17  and  was  .98  at  the  last 
session . 

Both  the  lecture  and  lens  model 
feedback  produced  changes  in  the 
appropriate  direction  and  the  changes 
persisted  over  6  months.  The  program 
produced  a  rapid  change  in  the  mean 
probability  estimates  and  calibration 
continued  to  improve.  These  changes 
occurred  in  simulated  cases,  but  recent 
studies  of  these  physicians  after  this 
intervention  indicate  they  became  more 
accurate  and  better  calibrated  in  their 
real  life  predictions. 

Thus,  although  these  initial 
applications  are  quite  limited  in 
scope,  the  feedback  of  diagnostic 
weighting  using  simulated  cases  appears 
very  promising  in  improving  physicians' 
diagnostic  and  therapeutic  predictions. 
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ABSTRACT 

Over  the  last  several  years,  teams  working  on  expert 
systems  have  been  exploring  formal  approaches  for  belief 
revision  and  information  acquisition.  The  formalization  of 
major  components  of  expert  systems  operation  is  useful  for 
understanding  and  characterizing  system  behavior  and  for 
predicting  changes  with  modification.  Formalization  also 
facilitates  the  involvement  of  investigators  in  more  well- 
developed  disciplines  such  as  statistics.  While  the  use  of 
formal  methodologies  for  diagnostic  problem  solving  is 
attractive  because  of  the  generality,  power,  and  axiomatic 
basis  of  inference,  the  methodologies  have  been  criticized 
for  making  inferences  that  are  difficult  to  understand  and 
explain.  I  shall  focus  on  the  problem  of  explaining  formal 
reasoning  methodologies.  The  PATHFINDER  system  for 
pathology  diagnosis  is  presented  as  an  example  of  current 
research  on  aspects  of  the  use  of  formal  methodologies  in 
expert  systems.  I  will  demonstrate  that  a  formal  system  is 
amenable  to  controlled  degradation  to  enhance  its 
explanation  capability. 


1.  INTRODUCTION 

It  is  fitting  that  there  be  a  focus  of  discussion  on  expert 
systems  in  a  session  on  computers  and  medical  decision¬ 
making.  Original  ground-breaking  research  on  expert 
systems  was  the  result  of  attempts  to  build  systems  to 
reason  about  complex  medical  problems  [4],  Expert 
systems  research  developed  within  the  field  of  artificial 
intelligence  over  a  decade  ago  and  is  now  an  established 
engineering  sub-discipline  of  artificial  intelligence.  It  is 
the  intent  of  expert  system  research  to  develop 
methodologies  for  the  representation  and  manipulation  of 
the  knowledge  of  experts  in  a  variety  of  disciplines. 

Artificial  intelligence  research  is  still  in  its  youth.  As  in 
other  new  disciplines  in  which  unifying  theories  have  not 
been  developed,  much  work  has  focused  on  non-axiomatic, 
descriptive  models.  In  this  paper,  I  would  like  to  briefly 
introduce  the  descriptive  and  formal  approaches  to  research 
in  artificial  intelligence  in  general.  I  will  stress  the 
usefulness  of  reasoning  methodologies  that  follow  from  a 
set  of  well-characterized  axioms.  I  will  then  introduce 
current  problems  with  the  use  of  formal  systems.  One 
frequent  criticism  of  formal  reasoning  strategies  is  that  they 
are  difficult  to  understand  and  explain.  I  will  focus  on  the 
problem  of  explanation  in  expert  systems  that  use  formal 
methods  for  reasoning  under  uncertainty.  In  this  regard,  I 
will  present  research  on  the  PATHFINDER  expert  system 
for  pathology  diagnosis  as  an  example  of  research  on 
aspects  of  the  use  of  formal  methodologies  in  expert 
systems.  In  answer  to  some  complaints  about  the  rigidity 
and  unnatural  nature  of  formal  systems,  I  shall  describe 
how  a  formal  system  is  amenable  to  controlled  degradation 
so  that  it  can  perform  more  descriptively. 


2.  AXIOMATIC  AND  DESCRIPTIVE  APPROACHES 

Science  has  been  marked  by  an  ongoing  attempt  to 
explain  observed  patterns  and  relationships  with  models  that 
provide  reasonable  explanations  and  predictability.  Useful 
theories  tend  to  simplify  phenomena  through  explaining 
complexity  with  a  relatively  small  number  of  empirically  or 
intuitively  justifiable  properties  or  axioms. 

Unfortunately,  theories  based  on  a  set  of  justifiable 
axioms  often  do  not  exist;  when  a  theory  is  enumerated,  it 
is  often  not  obviously  optimal,  unique,  or  desirable. 
Throughout  the  history  of  science,  when  useful  axiomatic 
theories  have  not  been  available,  scientists  have  resorted  to 
descriptive  models.  Such  models  summarize  complex 


behavior  by  describing  phenomenology  without  resorting  to 
fundamental  axioms.  They  capture  the  behavior  of  systems, 
often  through  the  postulation  of  relations  that  may  be 
inconsistent  with  one  another  or  with  other  accepted 
knowledge.  As  an  example,  before  Newton  constructed  the 
theory  of  universal  gravitation  and  Kepler  developed 
equations  describing  the  motion  of  objects  orbiting  in 
gravitational  fields,  astronomers  often  depended  on  epicycle 
machines.  These  machines  could  approximately  describe 
the  movement  of  heavenly  bodies,  as  viewed  from  the  earth, 
with  a  complex  tangle  of  gears  and  chains.  They  did  not 
explain  the  movement  of  heavenly  bodies  with  a  consistent 
theory  of  fundamental  relationships. 

2.1  Descriptive  Expert  Systems  Research 

Much  of  expert  systems  research  can  be  characterized  as 
either  axiomatic  or  descriptive.  The  descriptive  expert 
system  approach  centers  on  the  design  and  empirical 
evaluation  of  algorithms  that  mimic  aspects  of  human 
behavior.  Descriptive  expert  systems  research  is  not 
hindered  by  the  lack  of  a  formal  axiomatic  basis;  it  is  the 
intent  of  the  research  to  discover  useful  strategies  for 
representing  and  manipulating  expert  knowledge  regardless 
of  the  availability  or  acceptability  of  a  set  of  self- 
consistent  desiderata.  Investigators  in  the  descriptive  school 
of  research  view  exploration  of  the  sufficiency  of  informal 
models  of  human  problem  solving  as  a  more  direct 
approach  to  difficult  problems.  That  is,  given  poor 
understanding,  many  expert  systems  researchers  attempt  to 
capture  expertise  through  building  and  experimenting  with 
descriptive  models  in  the  spirit  of  the  epicycle  machines  of 
long  ago. 

As  an  example  of  the  descriptive  approach  to  expert 
system  design,  the  Present  Illness  Program  (PIP)  [23], 
developed  ten  years  ago  at  M.I.T.,  was  an  attempt  to 
simulate  the  cognition  of  a  physician's  reasoning  about 
patients  presenting  with  edema  (swelling).  A  central  aspect 
of  the  design  of  the  system  involved  an  analysis  of  the 
behavior  of  the  clinician.  Final  versions  of  PIP  had 
descriptive  cognitive  structures  called  the  supervisory 
program,  the  short-term  memory,  and  long-term  memory 
were  constructed. 

A  large  category  of  descriptive  systems  is  based  on  the 
rule-based  methodology  [4].  The  rule-based  expert  system 
methodology  is  the  result  of  attempts  to  adapt  the  use  of  an 
automated  logical  inference  methodology,  called  production 
systems  [32,  7],  to  capture  aspects  of  human  expertise. 
Production  systems  are  comprised  of  sets  of  logically 
interacting  inference  rules  of  the  form  IF  E  THEN  H, 
where  H  is  a  hypothesis  and  E  is  evidence  having  relevance 
to  the  hypothesis.  In  practice,  rules  of  logical  inference  are 
used  in  automated  deduction.  For  example,  modus  ponens 
and  simple  rules  of  unification  can  be  applied  to  a  set  or 
knowledge  base  of  rules  to  do  proofs  that  consist  of  the 
forward  or  backward  "chaining"  of  rules. 

One  of  the  most  prolific  early  expert  systems  was 
MYCIN  [31],  a  rule-based  expert  system  for  the  diagnosis 
of  bacterial  infection.  The  MYCIN  reasoning  framework 
remains  one  of  the  most  popular  expert  system 
methodologies.  MYCIN's  knowledge  is  stored  as  rules  that 
capture  the  relationships  among  relevant  medical  evidence 
and  hypotheses.  For  example,  a  rule  in  MYCIN  might  be: 
"if  an  organism  infecting  a  patient  is  gram-positive  and 
grows  in  clumps  then  add  support  to  the  hypothesis  that  the 
organism  is  staphylococcus."  It  was  recognized  early  on  in 
the  MYCIN  research  that  straightforward  application  of  the 
production  rule  methodology  would  be  insufficient  because 
of  the  uncertainty  in  the  relationships  between  evidence  and 


hypotheses  in  medicine. 

In  order  to  accommodate  these  non-determinislic 
relationships,  MYCIN  uses  certainty  factors  [4],  To  each 
rule,  a  certainty  factor  is  attached  which  represents  the 
change  in  belief  about  a  hypothesis  given  some  evidence. 
Certainty  factors  range  between  -1  and  1.  Positive  numbers 
correspond  to  an  increase  in  belief  in  a  hypothesis  while 
negative  quantities  correspond  to  a  decrease  in  belief.  An 
ad  hoc  calculus  for  evidence  combination  was  presented  in 
the  original  research  [30], 

2.2  The  Axiomatic  Approach 

In  contrast  to  the  descriptive  approach,  investigators 
pursuing  the  formal  axiomatic  approach  are  interested  in 
exploring  the  adequacy  of  systems  that  satisfy  desired 
properties.  That  is,  they  design  expert  systems  that  are 
necessarily  consistent  with  desired  properties.  When  such  a 
set  is  deemed  optimal  for  reasoning  in  the  context  of 
particular  tasks  it  is  termed  a  normative  theory  for 
reasoning. 

Investigators  interested  in  the  formal  approach  attempt  to 
design  expert  systems  that  behave  consistently  with 
established  theories  for  reasoning  under  uncertainty.  In 
exploring  the  automation  of  reasoning  under  uncertainty, 
investigators  have  focused  on  the  use  of  theories  for  the 
consistent  revision  of  belief  in  the  context  of  previous 
belief  and  for  controlling  information  acquisition.  Examples 
of  axiomatic  theories  that  have  been  used  in  expert  systems 
research  for  belief  revision  include  probability  [24],  fuzzy 
logic  [39],  Dempster-Shafer  theory  [28],  certainty  factors 
[30],  and  multi-valued  logics  [13].  Theories  used  for 

controlling  information  acquisition  include  information 
theory  [29]  and  decision  theory  [25.  26], 

Alternative  formalisms  are  often  based  on  clear  sets  of 
properties.  An  expert  system  engineer  can  base  an  expert 
system  on  a  set  of  properties  that  is  viewed  to  be  a 
particularly  intuitive  or  desired  For  example,  a  set  of 
simple  properties  about  continuous  measures  of  belief  can 
be  shown  to  necessitate  the  use  of  probability  theory  to 
manage  the  consistent  assignment  of  belief  [6,  36,  20]. 
Agreement  with  the  properties  necessitates  the  use  of 
probability  theory.  A  small  set  of  intuitive  properties  also 
lies  at  the  foundation  of  decision  theory  [37],  Of  course, 
there  are  differences  of  opinion  among  the  formalists  about 
the  optimality  or  necessity  of  particular  sets  of  axioms. 
For  example,  there  has  been  ongoing  debate  in  the  artificial 
intelligence  community  regarding  the  alternative 
methodologies  for  the  revision  of  belief  [5,  20], 

To  date,  there  have  been  several  attempts  to  base  expert 
reasoning  systems  on  well-defined  formalisms.  Three 
examples  are  the  Acute  Renal  Failure  [15]  system,  the 
MFDAS  [1]  system  for  emergency  medicine,  and  the 
PATHFINDER  [17]  system  for  lymphoma  diagnosis.  These 
systems  were  designed  to  be  consistent  with  well-understood 
formalisms  for  reasoning. 

Both  the  descriptive  and  axiomatic  approaches  have  led  to 
the  construction  of  systems  that  perform  at  levels  rivaling 
experts  in  a  variety  of  domains.  Given  the  complexity  of 
problems  at  hand  and  the  youth  of  the  field,  both 
approaches  have  been  useful  in  exploring  techniques  for 
automated  reasoning.  In  general  there  has  been  a  healthy 
interplay  between  the  the  descriptive  and  the  axiomatic 
research:  a  dynamic  research  milieu  is  created  by  the  co¬ 
existing  approaches. 


3.  rilF  BENEFITS  OF  FORM  A 11/ A  I  ION 

A  worthy  fundamental  goal  of  research  should  he  the 
eventual  development  of  useful  theories.  As  in  any  science, 
the  study  of  automated  reasoning  would  belief1. 1  greatly 
from  attempts  to  construct  theories  for  representing  and 


manipulating  knowledge.  Whether  an  investigator  initially 
chooses  to  become  involved  with  descriptive  or  formal 
research,  a  fundamental  goal  should  be  the  construction  of  a 
formal  science.  A  strong  theoretical  basis  for  components 
of  expert  reasoning  systems  would  be  extremely  useful. 
While  there  have  already  been  strides  in  the  application  of 
formal  theories  to  expert  systems,  greater  understanding 
could  facilitate  the  design,  control,  and  characterization  of 
expert  systems. 

The  subscription  to  axiomatic  bases  for  components  of 
expert  reasoning  can  be  useful  in  a  number  of  ways.  It  can 
assure  a  system  engineer  that  the  behavior  of  his  system 
will  remain  consistent  with  a  set  of  desired  properties. 
Basing  a  system  on  a  formal  theory  also  ensures  that  the 
system  will  be  self-consistent.  If  an  axiomatic  theory  is  not 
used  in  building  an  expert  system,  it  can  be  quite  difficult 
to  maintain  self-consistency.  The  presence  of 
inconsistencies  in  complex  computer  systems  often  leads  to 
unpredictable  behavior. 

Recent  research  on  the  ad  hoc  certainty  factor  model 
used  for  combining  evidence  in  the  MYCIN  system 
introduced  above  has  found  the  original  model  to  he  self- 
inconsistent  [16.  18],  Recent  work  has  focused  on 
removing  inconsistencies  in  the  model  [16].  The  consistent 
reformulation  of  certainty  factors  demonstrates  that  the 
belief  revision  theory  is  a  specialization  of  probability  in 
that  assumptions  of  conditional  independence  are  imposed 
by  the  methodology.  For  example,  it  can  be  shown  that 
evidence  must  be  conditionally  independent  given  H  and  its 
negation  [16].  The  determination  of  inconsistency  and  the 
detection  of  constraints  were  facilitated  by  the 
formalization  of  MYCIN's  reasoning  strategies. 

Formal  models  can  also  assist  an  engineer  greatly  when  a 
system  is  modified.  A  formal  system  allows  for  the  crisp 
prediction  of  changes  in  system  behavior  in  response  to 
system  modifications.  It  can  be  quite  difficult  to  predict 
the  impact  of  modifications  on  systems  for  which  no 
underlying  theoretical  structure  is  available.  Having  the 
ability  to  control  the  effect  of  system  modifications  is 
extremely  important  for  the  maintenance  of  systems,  for  the 
generalization  of  specific  successes,  and  for  the  incremental 
refinement  of  techniques.  Incremental  refinement  can  be 
particularly  significant  in  the  continuing  development  of  a 
theoretical  framework  for  automated  reasoning. 

Most  relevant  for  this  conference,  formalization  can  also 
be  crucial  for  expert  systems  research  to  benefit  from  the 
participation  of  investigators  in  other  highly -developed 
disciplines.  Issues  surrounding  descriptive  and  axiomatic 
expert  systems  research  are  of  special  relevance  in  this 
regard.  For  example,  expert  systems  research  would  benefit 
if  it  could  attract  statisticians  to  assist  in  solving  difficult 
problems.  Formal  descriptions  of  systems  and 
methodologies  are  important  as  they  provide  conceptual 
handles  necessary  for  communication  with  researchers  in 
other  fields. 


4.  PROBI  FMS  WITH  THE  FORY1AI  APPROAfl 


Two  central  issues  that  arise  in  discussions  of  the 
axiomatic  approach  are  problems  regarding  the  pragmatics 
of  engineering  and  compulation,  as  well  as  explanation. 

4,1  Trad, ability  of  Engineering  and  Computation 

More  so  than  for  any  other  reason,  researchers  m 
artificial  intelligence  have  looked  beyond  axiomatic-based 
techniques  for  complex  domains  because  of  the 
computational  overhead  of  inference  and  the  requirement 
for  large  amounts  of  knowledge  Formal  methodologies  ale 
viewed  as  having  an  insatiable  tlnist  for  data  and  computer 
processing  [8.  34] 


4.2  Explanation 

Another  significant  problem  cited  with  respect  to  formal 
methodologies  is  that  it  is  difficult  to  explain 
recommendations  to  users.  The  explanation  of  expert 
systems  has  been  identified  as  an  significant  factor  in  the 
acceptance  of  expert  systems  [35].  In  fact,  the  transparency 
of  reasoning  has  been  cited  as  a  fundamental  feature  of 
expert  systems,  distinguishing  them  from  numerical 
programs  and  other  kinds  of  reasoning  systems  in  artificial 
intelligence  [3].  The  important  role  of  reasoning 

transparency  in  expert  systems  has  made  explanation  an 
artificial  intelligence  research  focus. 

It  has  been  said  that  formal  methodologies  like 
j.  ‘bability  theory  and  decision  analysis  lead  to  unavoidable 
losses  in  comprehensibility  to  expert  system  users  [8,  34], 
The  manipulation  of  the  equations  of  conditional 
probability  or  decision  trees  may  indeed  be  quite  difficult 
to  succinctly  explain.  Such  difficulties  have  provoked  some 
of  the  ongoing  work  on  techniques  for  justifying  the  results 
of  formal  reasoning  strategies  [33,  27,  20],  We  shall  focus 
more  closely  on  this  problem  below. 

5.  GRACEFUL  DEGRADATION  OF  PERFORMANCE 

The  concerns  about  problems  with  explanation,  knowledge 
acquisition  and  computational  tractability  of  systems  based 
on  formalisms  for  reasoning  under  uncertainty  are  valid. 
Indeed  the  methodologies  demand  large  amounts  of  data  and 
computation.  Complaints  about  the  opacity  of  explanations 
of  recommendations  are  also  justified. 

Formal  methodologies  for  reasoning  under  uncertainty 
have  been  put  forth  as  general  theories.  They  have  not 
been  designed  for  use  in  complex  reasoning  systems  that 
might  be  dominated  by  limitations  in  computational  and 
engineering  resources.  An  interesting  and  potentially 
fruitful  area  for  investigation  is  the  development  of 
strategies  for  modifying  formal  methodologies  to  perform 
under  specified  constraints.  The  process  of  identifying 
pressing  resource  limitations  followed  by  an  attempt  to 
reformulate  theories  (deemed  optimal  in  a  world  with 
infinite  resources)  to  perform  in  constrained  environments 
could  be  more  useful  than  the  outright  dismissal  of  the 
theories.  Such  techniques  could  allow  and  engineer  to 
gracefully  degrade  a  systems  performance  to  reflect 
diminishing  amounts  of  available  engineering  or 
computational  resource. 

Theories  of  belief  revision  and  information  acquisition 
have  not  traditionally  been  accompanied  by  tools  that  allow 
a  well-defined  relaxation  of  restrictions  or  requirements.  It 
would  be  productive  to  develop  such  methodologies  to 
generate  well-characterized  trade-offs  such  as  between  the 
accuracy  of  a  recommendation  and  computation  time. 
Useful  approaches  to  graceful  degradation  of  various  aspects 
of  reasoning  behavior  would  make  the  disagreement  with 
properties  of  general  parent  theories  clear.  The 
development  of  strategies  for  the  controlled  degradation  of 
reasoning  would  allow  artificial  intelligence  researchers  to 
continue  to  build  upon  the  theoretical  achievements  of 
more  mature  disciplines. 

We  will  now  turn  to  an  example  of  the  degradation  of 
expert  system  performance  to  satisfy  constraints  on  the 
complexity  of  inference.  As  we  shall  see,  degrading  in 
optimal  reasoning  methodology  can  serve  to  enhance  the 
explanation  capability  in  an  expert  system. 

6.  EXPLAINING  COMPLEX  REASONING 

I  would  like  to  demonstrate  an  example  of  the 
decomposition  of  a  complex  reasoning  methodology.  I  hope 
that  it  may  serve  as  an  example  of  a  category  of  sluiegies 
that  can  help  investigators  successfully  apply  axiomatic 


models.  First  I  will  present  an  information-optimizing 
reasoning  strategy  that  makes  inferences  that  are  difficult  to 
explain.  I  will  then  describe  how  a  less  efficient  but  more 
explainable  strategy  could  be  generated. 

6.1  The  Complexity  of  Reasoning  Under  Uncertainty 

We  have  proposed  [19]  that  a  central  aspect  of  the 
difficulty  that  investigators  have  had  in  explaining  expert 
system  recommendations  is  based  on  the  intrinsic 
complexity  of  formal  reasoning  under  uncertainty.  As 
often  noted,  a  fundamental  difference  between  simple 
deduction  and  more  general  reasoning  under  uncertainty  is 
the  inference  complexity:  within  a  deductive  system,  any 
particular  path  to  a  conclusion  is  considered  to  be  a 
sufficient  proof;  in  contrast,  reasoning  under  uncertainty 
usually  entails  the  consideration  of  all  paths  [5]  Formal 
theories  of  belief  revision  and  information  acquisition 
generally  involve  the  parallel  consideration  of  a  greater 
number  of  propositions  than  simple  logical  deduction 
problems.  For  example,  probabilistic  reasoning  systems 
calculate  the  values  of  single  conditional  probabilities  to 
summarize  many  steps  of  inference.  This  complex 
summarization  process,  so  central  in  probabilistic  inference, 
has  been  seen  as  a  problem  in  expert  system 
understandability  [8], 

What  is  the  fundamental  basis  for  problems  with 
complexity?  Cognitive  psychology  results  can  lend  insight 
to  this  question.  Problems  associated  with  the 
comprehension  of  complex  problems  such  as  the  operation 
of  complex  reasoning  strategies  have  been  a  longtime 
research  focus  within  cognitive  psychology  [2],  Classic- 
research  in  this  field  has  demonstrated  severe  limitations  in 
the  ability  of  humans  to  consider  more  than  a  handful  of 
concepts  in  the  short  term  [21],  In  fact,  studies  [38]  have 
discovered  that  humans  cannot  retain  and  reason  about 
more  than  two  concepts  in  an  environment  with 
distractions.  Such  results  underscore  the  need  for  managing 
the  complexity  of  expert  systems  inference. 

For  humans  to  successfully  understand,  plan,  prove,  and 
design  in  environments  that  are  informationally  complex, 
they  must  devise  schemes  for  decomposing  large  unwieldy 
problems  into  smaller,  interrelated  sub-problems.  I  will 
present  our  work  on  ihe  enhancement  of  explanation 
through  the  decomposition  of  complex  formal  reasoning. 
Before  presenting  the  work,  I  must  first  describe  the 
hypothelico-deductive  architecture  of  PATHFINDER. 

7.  THE  PATHFINDER  PROJECT 

PATHFINDER  [17]  is  a  hypothelico-deductive  expert 
system  for  the  diagnosis  of  lymph  node  pathology  based 
upon  the  appearance  of  microscopic  features  in  lymph  node 
tissue.  Disease  manifestations  in  lymph  node  pathology  are 
microscopic  features.  Features  are  each  subdivided  into  a 
mutually  exclusive  and  exhaustive  list  of  values.  Features 
are  evaluated  by  the  selection  of  a  value  that  reflects  the 
status  of  the  feature  in  the  case  being  reviewed.  We  say 
that  the  assignment  of  a  value  to  a  feature  constitutes  a 
piece  of  evidence.  The  PATHFINDER  system  reasons 
about  80  diseases,  considering  over  500  pieces  of  evidence. 

7.1  The  Hypothelico-Deduclive  Architecture 

The  PATHFINDER  system  is  based  on  the  hypothelico- 
deductive  architecture.  The  hypothelico-deductive  method 
(also  referred  to  as  the  method  of  sequential  diagnosis 

[14])  has  been  studied  in  several  expert  systems  research 
projects  including  the  Acute  Renal  Failure  [15]  system,  the 
INTERNIST-1  [22]  system  for  diagnosis  within  the  field 
of  internal  medicine,  and  the  MEDAS  [1]  system  for 
emergency  medicine. 

Hypothelico-deductive  systems  are  presented  with  an 
initial  set  of  evidence.  The  initial  evidence  is  used  to 
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assign  a  probabilistic  or  quasi-probabilistic  score  to  each 
hypothesis  and  a  list  of  plausible  hypotheses  is  formulated 
from  the  scores.  Then,  questions  are  selected  which  can 
help  decrease  the  number  of  hypotheses  under  consideration. 
After  a  user  replies  to  requests  for  new  information,  a  new 
set  of  hypotheses  is  formulated  and  the  entire  process  is 
repeated  until  a  single  diagnosis  is  reached. 

The  question  selection  strategies  are  termed  hypothesis- 
directed  in  that  reasoning  strategies  operate  on  the  current 
list  of  hypotheses  under  consideration  to  generate 
recommendations  for  additional  evidence  gathering. 
Investigators  in  the  INTERNIST-1  and  PATHFINDER 
research  groups  have  explored  the  usefulness  of  tailoring 
different  reasoning  strategies  to  the  current  list  of  diseases 
under  consideration  or  differential  diagnosis.  For  example, 
the  strategy  selected  to  narrow  the  differential  diagnosis 
may  depend  upon  the  number  of  diseases  on  the 
differential,  the  probability  distribution  over  the 
differential,  or  both. 

The  advice  generated  by  hypothesis-directed  strategies  is 
often  difficult  to  explain  because  of  the  complexity  of  their 
operation.  This  is  especially  true  if  recommendations  are 
the  result  of  inferences  based  on  a  large  hypothesis  list. 
Hypothesis-directed  strategies  may  consider  the  relevance  of 
hundreds  of  hypotheses  in  a  single  inference  step. 

The  scoring  scheme  employed  by  PATHFINDER  is  based 
upon  the  theory  of  subjective  probability  [9],  The 
subjective  probabilities  of  experts  are  used  to  infer  the 
probability  that  each  disease  is  responsible  for  the  evidence 
that  has  been  entered  into  the  system.  Depending  on  the 
number  and  the  distribution  of  probabilities  among  diseases 
on  the  differential  diagnosis,  PATHFINDER  chooses  one  of 
several  alternative  diagnostic  strategies  for  selecting 
questions.  As  in  other  hypothesis-directed  systems,  it  is  the 
goal  of  the  question  selection  strategies  to  suggest  the 
optimal  test  to  be  evaluated  next  in  an  effort  to  reduce  the 
uncertainty  in  the  differential  diagnosis. 

Several  PATHFINDER  strategies  discriminate  among 
large  numbers  of  diseases  and  features  in  the  generation  of 
advice.  I  shall  not  describe  all  of  the  hypothesis-directed 
reasoning  strategies  used  by  PATHFINDER.  Rather,  we 
will  look  at  issues  surrounding  the  explanation  of  a 
particular  PATHFINDER  hypothesis-directed  reasoning 
strategy  termed  entropy-discriminate  and  its  descendant, 
group-discriminate. 

7.2  A  Strategy  to  Minimize  Uncertainty 

The  PATHFINDER  entropy-discriminate  reasoning 
strategy  was  originally  used  to  refine  differential  diagnosis 
disease  lists  ranging  in  size  from  two  to  eighty  diseases.  The 
strategy  makes  recommendations  about  information 
acquisition  by  searching  for  tests  that  maximize  a  measure 
of  information  contained  in  the  differential  diagnosis. 
Similar  information-maximizing  strategies  have  been 
examined  in  the  MEDAS  and  Acute  Renal  Failure  systems. 

Entropy-discriminate  makes  use  of  a  measure  of 
information  known  as  relative-entropy.  In  this  context, 
relative  entropy  is  a  measure  of  the  additional  information 
provided  by  a  piece  of  evidence  E,  about  a  differential 
diagnosis  DD.  Formally, 

H ( DO . E , ) 

*  p(Dj | E, )  log[p(Dj )/p(Dj  |  E, )  ] , 

where  p(Dj)  is  the  probability  that  disease  D;  is  present 
before  evidence  E,  is  known,  the  prior  probability  of  the 
disease,  and  pfDjIE,)  is  the  probability  that  disease  D;  is 
present  after  evidence  E,  is  known,  the  posterior  probability 
of  the  disease.  For  a  justification  of  relative  entropy  as  a 
measure  of  information  gain,  see  [29], 


As  each  feature  consists  of  a  set  of  mutually  exclusive 
and  exhaustive  values,  we  can  denote  the  possible  evidence 
associated  with  a  particular  feature,  F,  as  Et..En,  where  n  is 
the  number  of  mutually  exclusive  values  associated  with  the 
feature.  Entropy-discriminate  selects  features  which  give 
the  highest  expected  relative  entropy 

<H(DD ,  Fn)>  =  I',  p(E,)  H(OD.E,). 

where  the  quantity  is  summed  over  feature  values  E,..En, 
and  p(E;)  is  calculated  using  the  expansion  rule 

P(E,)  =  p(E, |Dj)  p(Dj) . 

In  an  information-theoretic  sense,  the  questions  selected 
by  the  entropy-discriminate  strategy  are  optimal  assuming 
that  the  goal  of  the  pathologist  is  to  reduce  uncertainty  in 
the  differential  as  much  as  possible. 

7.3  Problems  With  the  Optimal  Strategy 

Soon  after  the  implementation  of  entropy-discriminate 
mode,  we  discovered  that  several  expert  pathologists, 
including  the  expert  that  provided  the  system's  knowledge, 
often  found  that  selected  questions  were  difficult  to 
understand  when  the  differential  contained  more  than 
approximately  ten  diseases  The  entropy-discriminate 
strategy  of  selecting  questions  that  best  discriminate  among 
all  diseases  on  a  differential  diagnosis  often  seemed  to  be 
too  complex  for  experts.  This  is  not  surprising  in  light  of 
the  limitations  of  human  short  term  memory  discussed 
above. 

We  also  had  problems  explaining  the  recommendations  of 
entropy-discriminate  whenever  there  were  more  than  two 
diseases  on  the  differential.  Attempts  were  made  to  provide 
textual  and  graphical  explanations  for  the  powerful 
strategy's  recommendations.  One  such  graphical  explanation 
justified  questions  by  listing,  for  each  disease,  the  feature 
value  that  would  most  favor  the  disease.  Physicians  found 
such  complex  summarizations  to  be  difficult  to  understand. 

7.4  The  Graceful  Decomposition  of  Diagnostic  Problem 
Solving 

The  observed  problems  with  the  entropy-discriminate 
strategy  stimulated  our  interest  in  strategies  for  simplifying 
and  explaining  hypothesis-directed  reasoning.  We 
discovered  that  pathologists  often  manage  the  complexity  of 
the  diagnostic  problem-solving  task  by  reasoning  about  a 
very  small  number  of  disease  categories  or  groups  at  any 
one  time.  Questions  that  discriminate  among  natural  groups 
tend  to  be  proposed. 

Specifically,  the  chief  expert  pathologist  on  the 
PATHFINDER  team  often  imposes  a  simple  two-group 
discrimination  structure  on  the  problem-solving  task.  As 
opposed  to  a  strategy  of  discriminating  among  all  the 
diseases  on  the  differential,  the  pathologist's  discrimination 
task  at  any  point  in  reasoning  about  a  case  is  constrained  to 
only  two  groups  of  diseases.  As  categories  of  diseases  are 
ruled  out,  the  particular  pairs  of  groups  considered  become 
increasingly  specific.  For  example,  if  there  are  benign  and 
malignant  diseases  on  a  differential  diagnosis,  the  pathology 
expert  often  deems  most  appropriate  those  questions  that 
best  discriminate  between  the  benign  and  malignant  groups 
rather  than  questions  that  might  best  discriminate  among  all 
of  the  diseases.  If  all  benign  diseases  have  been  ruled  out, 
leaving  only  primary  malignancies  and  metastatic  diseases 
on  the  differential  diagnosis,  the  pathologist  will  attempt  to 
discriminate  between  the  primary  malignancy  and  the 
metastatic  categories. 

We  found  that  the  expert's  diagnostic  strategy  can  be 
described  by  the  traversal  of  a  hierarchy  of  disease 
categories.  The  problem-solving  hierarchy  (see  Fig  1)  is  a 
binary  tree  of  disease  groups.  The  hierarchy  can  be  used  to 
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group  the  differential  diagnosis  at  various  levels  of 

refinement. 

It  is  interesting  to  note  that  several  previous  studies  of 
medical  reasoning  have  identified  similar  problem-solving 
hierarchies  [10,  11,  12]  for  managing  the  complexity  of  a 
wide-variety  of  reasoning  tasks. 

The  discovery  of  this  expert  reasoning  strategy  in  lymph 
node  pathology  suggested  the  development  of  a  new 
question-selection  strategy  that  could  discriminate  among 
binary  groups  of  diseases  instead  of  individual  diseases.  It 
was  hoped  that  design  and  application  of  such  a  strategy 
would  make  explanation  clear,  as  the  user  would  only  have 
to  consider  the  relevance  of  a  recommendation  to  two 
groups. 

Our  attempt  to  naturally  constrain  the  discriminatory 
focus  of  the  entropy-discriminate  strategy  led  to  a  new 
reasoning  strategy  we  named  group-discriminate.  The 
group-discriminate  strategy  selects  questions  based  on  their 
ability  to  discriminate  between  the  most  specific  pair  of 
disease  categories  that  account  for  all  diseases  on  the 
differential. 

For  a  given  differential  diagnosis,  group-discriminate 
identifies  the  most  specific  grouping  possible  and  then 
selects  questions  that  best  discriminate  among  groups  of 
diseases.  More  formally,  suppose  the  differential  is  split 


Figure  I:  Heuristic  problem-solving  hierarchy 

into  two  groups,  G|  and  G2,  of  n,  and  n2  diseases 
respectively: 

G,  -  {0,,.  D„.  ...  D,nj} 

G?  =  (D?i-  D?Z’  •  •  - 

As  we  assume  that  only  one  lymph  node  disease  is  present 
in  PATHFINDER,  we  can  consider  the  diseases  to  be 
mutually  exclusive  events.  We  are  interested  in  the 
probability  that  the  true  diagnosis  will  be  in  each  group.  To 
calculate  this  probability  we  add  the  probabilities  of  all  the 
diseases  within  each  group.  That  is,  the  probability  that  a 
group  contains  the  true  diagnosis  is 

P(Gj)  *  V*  p(0j  k).  j  «  1,  2. 

We  can  also  calculate  pfGjIF.,),  the  probability  of  the  final 
diagnosis  being  contained  in  a  group,  considering  a  new 
piece  of  evidence  F,.  This  is 

P(G,|f,)  *  \  p(D,JE().  j  =  1  or  7. 


Therefore,  a  relative  entropy  of  the  grouped  differential  can 
be  defined.  In  particular, 

Hg(DD,E,)  = 

P { G j | E , )  log[p(Gj | E4)/p(Gj)] . 

This  quantity  represents  the  additional  information 
contained  in  E|  about  the  grouped  differential  diagnosis. 
Group-discriminate  selects  those  features  which  give  the 
highest  expected  relative  entropy. 

Notice  that  the  group-discriminate  strategy  ignores 
information  concerning  the  probabilities  of  diseases  within 
each  group.  Only  the  probabilities  that  the  true  diagnosis 
lies  within  a  group  is  considered  in  the  calculations. 

8.  DISCUSSION 

We  integrated  the  group-discriminate  strategy  into  the 
PATHFINDER  system  so  that  it  continues  to  refine 
differential  diagnosis  lists  until  all  diseases  remaining  on 
the  differential  diagnosis  are  in  a  category  at  one  of  the 
leaves  of  the  binary  problem-solving  tree.  At  this  point, 
other  hypothesis-directed  strategies  are  applied  to  continue 
pursuing  a  diagnosis.  As  the  group-discriminate  reasoning 
strategy  has  a  simpler  discriminatory  focus  and  more  closely 
follows  the  decision  making  protocol  of  the  expert  lymph 
node  pathologist  than  entropy-discriminate,  it  is  quite  easy 
to  explain. 

Instead  of  having  to  present  complex  summaries 
explaining  how  each  piece  of  evidence  might  impact  on 
belief  in  the  presence  of  a  number  of  diseases,  an 
explanation  of  questions  generated  by  group-discriminate 
must  simply  demonstrate  how  possible  responses  affect  the 
two  groups  under  consideration. 

The  PATHFINDER  system  justifies  the  usefulness  of 
questions  selected  by  group-discriminate  with  a  graphical 
display.  Fig.  2  presents  a  small  portion  of  a  PATHFINDER 
consultation.  At  the  top  of  the  figure  is  the  differential 
diagnosis,  grouped  into  benign  and  malignant  categories  (at 
the  current  level  of  refinement).  Below,  several  lymph  node 
features  recommended  by  group-discriminate  are  listed. 
The  group-discriminate  strategy  has  determined  that  these 
features  can  best  discriminate  between  the  benign  and 
malignant  diseases.  In  this  case,  the  user  requested 
explanation  for  the  follicles  density  recommendation. 

The  positions  of  a  set  of  asterisks  in  the  justification 
graph  at  the  bottom  of  the  figure  are  used  to  indicate  the 
degree  to  which  each  group  of  diseases  is  favored  by  each 
possible  feature  value.  Specifically,  the  position  of  an 
asterisk  is  a  function  of  the  likelihood  ratio 
p(Ej|G1)/pTEi|G2)-  In  the  example,  the  values  separated 
and  far  apart  strongly  support  diseases  on  the  differential 
diagnosis  that  are  in  the  benign  group,  while  the  values 
back-to-back  and  closely  packed  strongly  support  the 
malignant  disease  hypotheses. 

A  user  can  easily  ascertain  how  a  question  discriminates 
among  two  groups  of  diseases;  evidence  is  either  supportive 
for  one  group  or  the  other.  Even  in  an  environment  filled 
with  distractions,  the  behavior  of  the  strategy  is  adequately 
explained  by  such  simple  graphs. 

Unfortunately,  the  more  explainable  group  reasoning 
strategy  has  some  disadvantages.  A  predictable  problem 
with  the  use  of  group-discriminate  is  that  the  differential 
diagnosis  refinement  process  does  not  always  proceed  as 
quickly  as  it  does  with  the  application  of  the  optimal 
entropy-discriminate.  That  is,  group-discriminate  is  not  as 
efficient  as  the  more  powerful  entropy-discriminate;  on 
average,  a  larger  number  of  evidence-gathering  requests  will 
be  made  by  group-discriminate  to  achieve  a  similarly 
refined  differential  diagnosis.  This  must  be  the  case  as 
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Dltcrlir Inetlng: 

Malignant 

Small  claaved.  follicular  lymphoma 
Mlxad.  small  claavad  and  large  call, 
follicular  lymphoma 
Large  call,  follicular  lymphoma 
Kaposls  sarcoma 

Small  nonclaavad.  follicular  lymphoma 
Benign 

Florid  reactive  follicular  hyperplasia 

Reactive  hyperplasia 

AIDS 


I  recommend  that  the  following 
features  be  evaluated: 

Status  of  follicles 
Follicles  density 
Subcapsular  sinuses 
Medullary  sinuses 
Comparison  of  cytology  Inside  and 
outside  the  follicles 


>  Justify 

Which  feature  do  you  want  justified? 

>  follicles  density 

The  following  table  elucidates  the 
discriminating  power  of  this  feature. 
The  position  of  the  asterisk  indicates 
which  of  the  two  groups  of  diseases  Is 
favored  by  each  value. 

Malignant 

Benign 

I  I 

v  v 

•  .  beck-to-back 

*  .  closely  packed 

. *.  separated 

. •  far  apart 


Figure  2:  PATHFINDER  consultation 

detailed  information  about  the  plausibility  of  individual 
diseases  within  each  group  is  discarded  in  the  grouping 
process. 

In  general,  simplification  of  an  optimal  strategy  will  lead 
to  a  less-efficient  strategy.  Also,  given  the  limits  of  human 
cognition  identified  by  research  in  cognitive  psychology,  it 
is  not  unexpected  that  a  reasoning  strategy  derived  through 
the  constraint  or  decomposition  of  a  complex  problem¬ 
solving  task  may  be  easier  to  understand  and  explain.  It 
seems  that  for  a  wide  variety  of  reasoning  strategies,  there 
will  frequently  be  an  inverse  relationship  between  reasoning 
understandability  and  efficiency.  In  making  decisions  about 
alternative  reasoning  strategies  and  the  clarity  of 
explanation  for  expert  systems,  computer  scientists  may  be 
able  to  make  use  of  a  well-characterized 
explainability/efficiency  trade-off. 


9.  CONCLUSION 

I  discussed  the  usefulness  of  automated  reasoning 
methodologies  that  follow  from  desired  fundamental 
properties  and  presented  an  example  of  the  application  of  a 
strategy  that  gracefully  degrades  complex  reasoning  of  an 
expert  system.  The  degradation  was  based  in  the 
decomposition  of  the  diagnostic  task.  The  degradation 
strategy  enabled  the  system  to  generate  transparent 
justifications  for  its  requests  for  information,  in  exchange 
for  a  reduction  in  the  optimality  of  its  recommendations. 


I  believe  that  continuing  research  on  the  pragmatics  of 
applying  formal  models  in  the  face  of  severe  limitations  in 
data  and  computation,  as  well  in  the  abilities  of  system 
users  will  be  beneficial.  The  development  and  refinement 
of  methodologies  for  the  controlled  degradation  of 
reasoning  will  allow  artificial  intelligence  researchers  to 
build  upon  the  elegant  achievements  of  other  disciplines. 
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Computer  Applications  of  Bayesian  Statistics  in  Medicine 

Holly  Jimison,  Stanford  University 


ABSTRACT 

The  Bayesian  statistical  methodology  is  an 
especially  important  technique  for  computer- 
assisted  medical  decision  making  because 
there  is  often  a  shortage  of  data  directly 
related  to  a  given  clinical  decision  or 
classification  problem.  Also,  many  clinical 

relationships  are  noisy  and  weakly  correlated. 
Under  these  conditions  a  priori  information  has 

significant  influence  on  the  ultimate 

classification  rule.  Bayesian  statistics 

incorporates  a  priori  knowledge  and 
conveniently  handles  variations  in  costs  of 
misclassifications  with  a  loss  function.  This 
paper  shows  how  Bayesian  analysis  is 
appropriate  for  medical  decision  making, 
reviews  problems  seen  with  such  systems,  and 
provides  suggestions  and  examples  of  how 
some  systems  have  addressed  these  problems. 

VALUE  OF  A  BAYESIAN  APPROACH 

Medicine  is  a  very  difficult  domain  for 
decision  making,  and  an  especially  challenging 
area  in  which  to  try  and  automate  this  process. 
Medical  decisions  are  characterized  as  being 
important,  in  that  the  utilities  of  the  possible 
outcomes  can  be  dramatically  different.  Also, 
the  decisions  typically  need  to  be  make  fairly 
quickly,  but  with  incomplete  and  noisy  data.  A 
priori  information  becomes  an  important  part  of 
a  model  in  such  a  situation,  and  Bayesian 
analysis  provides  for  explicit  representation  of 
both  a  priori  information  and  utilities.  As  one 
considers  problems  from  areas  on  the 
continuum  going  from  physics  and  engineering 
to  biology,  physiology,  and  clinical  medicine, 
the  appropriate  models  for  systems  in  these 
areas  become  less  deterministic  and  more  ill 
structured.  For  most  clinical  situations,  there 
are  an  intractable  number  of  confounding 
factors  that  may  affect  a  particular  variable  of 
interest,  in  ways  that  are  highly  situation- 
dependent.  This  leads  to  weak  models  with 

weak  correlations.  Accurate  clinical  models 
often  need  to  make  use  of  associational 
relationships  more  than  causal  or  mechanistic 
relationships.  This  shift  parallels  the  clinical 
reasoning  used  by  physicians  as  they  gain 
clinical  experience.  Medical  students  naturally 
have  a  tendency  to  rely  on  "textbook" 
knowledge,  which  is  mainly  causal  and 
mechanistic.  As  clinical  experience  is  gained, 
more  use  is  made  of  associational  knowledge, 
and  much  of  the  reasoning  for  diagnosis  and 
treatment  seems  to  be  pattern-matching. 
However,  causal  knowledge  is  still  relied  on 
for  explanation  and  for  reasoning  about  new 
situations.  A  statistical  model  is  very 
appropriate  for  the  pattern-matching  reasoning. 
A  Bayesian  statistical  model  is  especially 


appropriate  for  many  medical  domains  where 
the  relationships  between  variables  are  weak 
and  noisy,  and  when  a  priori  information  has 
more  influence.  However,  causal  knowledge  is 
not  explicity  captured  in  strictly  Bayesian 
systems,  so  that  there  is  no  natural  mechanism 
for  providing  a  causal  explanation  of  the 
resulting  classification.  Some  Bayesian 
systems  that  incorporate  a  form  of  explanation 
are  described  later. 

Another  characteristic  of  the  medical  domain 
is  that  there  is  almost  always  a  shortage  of 
statistical  data  relevant  to  the  patient  at  hand. 
A  small  data  set  also  makes  the  a  priori 
information  more  influencial  in  the  classification 
rule.  Again,  situations  like  this  favor  a  model 
that  explicitly  incorporates  a  priori  information. 
A  further  feature  of  a  Bayesian  approach  to 
classification  and  decision  making  is  that 
variations  in  costs  of  misclassifications  are 
easily  incorporated  into  the  decision  rule,  as 
shown  below.  In  medical  applications  it  is 
often  not  appropriate  to  assume  equal  costs  of 
misclassifications,  as  many  other  approaches 
do.  Quite  often  in  medicine  false  negatives 
are  more  serious  than  false  positives.  Trade¬ 
offs  between  the  two  types  of  errors  depend 
upon  the  specific  application. 


CONCERNS  WITH  A  BAYESIAN  APPROACH 

Although  Bayesian  approaches  to  medical 
decision  making  have  certainly  been  popular 
for  many  applications,  there  are  caveats  that 
need  to  be  addressed  when  designing  such  a 
system.  What  follows  is  a  description  of  the 
major  concerns  with  Bayesian  systems  as  well 
as  recommendations  on  how  to  rectify  the 
problems. 

1.  Assumptions  Required 

The  use  of  Bayes  rule  in  its  complete  form 
for  the  assignment  of  probabilities  to  a  field  of 
many  diseases  requires  an  immense  amount  of 
data.  The  following  formula  shows  that  prior 
and  conditional  probabilities  on  combinations  of 
features  and  diseases  are  necessary  in  order 
to  determine  posterior  probabilities  for  each 
disease. 


P(Di)P(F|Di) 

P(D1 1  F)  . . . 

i'P  ( D  j )  P  ( F  |  D  j ) 

In  this  formula  D|  represents  a  particular 

disease  and  F  represents  a  particular  feature 
vector. 


The  following  assumptions  are  usually  made, 
not  for  theoretical  reasons,  but  for  simplicity, 
ease  of  calculation,  and  due  to  lack  of  data. 

a.  Conditional  independence  of  features  given 
disease:  Classification  features  are  almost 
always  assumed  to  be  independent  of  one 
another.  This  greatly  simplifies  the  calculation 
of  the  posterior  probabilities  and  significantly 
reduces  the  amount  of  data  collection 
necessary  for  such  a  system.  If  the  individual 
features  are  conditionally  independent  of  one 
another,  one  need  only  have  data  for  each 
feature  value  given  a  disease  instead  of  data 
on  each  combination  of  feature  values.  This 
provides  an  exponential  reduction  of  the 
amount  of  data  required.  For  a  small  feature 
vector  Fsffj.fj.fg}  the  formula  now  becomes 


P(Dt)P{f,|01)P(r2|D1)P(f3|01) 

P(01|f1.f2.r3)  -  . 

—  P (0j)P(fj|Oj)P(f2|Oj)P(f3|Oj) 


Probably  the  most  common  criticism  of 
Bayesian  algorithms  in  medicine  is  that  the 
features  used  are  not  conditionally 
independent,  even  though  the  assumption  is 
made.  Actually,  good  system  design  involves 
careful  feature  selection.  The  choice  of 
features  and  their  program-specific  definitions 
can  be  optimized  using  an  information  metric, 
such  as  directed  divergence.  Indepence  can 
also  be  tested  for  by  observing  the  correlation 
between  features.  For  features  that  are 
correlated,  the  system  developer  has  the 
option  of  creating  a  new  single  feature  that  is 
an  index  based  on  some  weighting  of  the 
correlated  features.  Thus,  there  are  ways  of 
dealing  with  the  problem  of  assuming 
conditional  independence,  and  it  is  also 
important  to  note  that  other  algorithm  models, 
such  as  rule-based  systems,  often  require 
conditional  independence,  even  though  the 
assumption  is  not  made  explicitly. 

b.  Diseases  are  mutually  exclusive:  Although 
it's  possible  to  consider  all  possible 
combinations  of  diseases  as  separate  diseases 
in  order  to  hold  to  the  same  Bayesian  structure 
and  yet  handle  combinations  of  diseases,  this 
approach  requires  a  significant  amount  of  data 
that  is  usually  not  available.  Unless  diseases 
are  correlated  or  have  a  high  prior  probability, 
there  is  not  much  of  a  chance  of  obtaining 
sufficient  data  on  combinations  of  di  ;eases. 
This  in  itself  suggests  a  solution  to  the 
problem  of  combinations  of  diseases.  That  is, 
if  the  combination  is  prevelant  enough  for  there 
to  be  data  available,  then  perhaps  it  should  be 
considered  as  a  separate  entity  to  be 
diagnosed.  Otherwise,  the  diseases  could  be 
assumed  to  be  independent  of  one  another  and 
a  meta-diagnostic  strategy  could  be  used  to 


say  which  diseases  and  disease  combinations 
were  actually  present.  A  simple  threshold  on 
probability  might  be  used,  or  one  might  choose 
a  more  complicated  strategy  like  the 
partitioning  algorithm  used  in  the  INTERNIST 
system  from  the  University  of  Pittsburg. 


c.  Diseases  are  collectively  exhaustive:  This 
assumption  simply  means  that  the  model 
covers  the  universe  of  possible  events.  All 
possible  diseases  should  be  represented,  as 
well  as  the  event  "no  disease  at  all."  Of 
course,  this  is  not  feasible  in  a  practical 
system.  What  is  usually  done  is  to  work 
within  the  context  of  a  smaller  domain, 
covering  the  diseases  or  events  of  interest, 
and  leaving  a  final  class  of  "other"  for  any 
remaining  diseases  or  events.  The  set  then 
becomes  collectively  exhaustive. 


2.  Subjective  vs  Objective  Probabilities 

Naturally,  objective  probabilities,  frequency 
data  derived  from  observations,  should  be 
incorporated  into  a  Bayesian  system  if 
sufficient  relevant  data  are  available.  There 
are  many  reasons  why  sufficient  relevant 
objective  data  may  not  be  available,  and 
subjective  probabilities  must  be  obtained  from 
experts.  Firstly,  data  on  humans  is  usually 
very  expensive  and  difficult  to  obtain.  There 
may  be  quality  data  on  rats  or  mice,  but 
subjective  probabilities  would  be  required  to 
modify  it  for  inference  about  humans.  Even  a 
study  providing  good  data  on  humans  is  not 
likely  to  perfectly  match  a  given  patient  in 
question.  A  clinician  will  want  to  subjectively 
modify  probabilities  to  account  for  patient- 
specific  factors.  More  generally,  causal 
knowledge  about  disease  processes  needs  to 
be  encoded  subjectively  if  its  effect  has  not 
been  accounted  for  in  observed  data  from  a 
study.  Another  situation  where  subjective 
evaluation  of  probabilities  for  a  Bayesian 
system  is  necessary  comes  when  one  tries  to 
incorporate  data  *  from  different  studies.  The 
results  may  conflict,  definitions  or  study 
designs  may  be  different,  the  population 
sampled  may  be  quite  different,  etc.  The 
synthesis  of  these  results  is  necessarily  quite 
subjective,  or  at  least  heuristic.  There  have 
been  Bayesian  systems  of  both  types  that 
have  provided  expert  level  performance.  For 
example,  de  Dombal's  system  for  acute 
abdominal  pain  used  observed  frequencies 
from  a  teaching  data  set  for  its  system's 
probabilities  (the  performance  degraded  when 
experts'  subjective  probabilities  were  used), 
and  Gorry's  system  for  management  of  acute 
renal  failure  achieved  expert  performance  using 
subjective  probabilities  from  experts 


Often  clinicians  find  it  easier  to  provide 
quality  information  when  asked  to  estimate 
prior  odds  and  likelihood  ratios.  This  is 
appropriate  for  the  odds  form  of  Bayes  rule. 

P(D|F)  P(D)  P ( F | D ) 


P(notD| F)  P(notD)  P(F|notD) 


3.  Learning  of  Priors  and  Conditionals 

A  priori  probabilities  and  conditional 
probabilities  obtained  for  use  in  Bayesian 
systems  must  be  verified  and  possibly 
modified  for  application  in  different  patient 
populations  (new  locations,  new  clinics,  etc  ). 
Also,  the  probabilities  may  change  in  time  with 
changes  in  lifestyles,  environmental  factors, 
treatments,  and  general  disease  characteristics. 
Ideally,  a  system  would  be  able  to  learn  and 
update  these  probabilities  on  its  own.  One 
fairly  common  mistake,  that  needs  to  be 
avoided,  is  using  the  system’s  own 
classification  or  diagnosis  on  each  event  as 
data  for  calculating  new  probabilities.  This 
type  of  learning  is  decision-directed  learning, 
and  the  problem  with  this  is  that  mistakes 
propagate  mistakes.  In  fact,  it  is  possible  to 
equilibrate  with  inaccurate  probabilities  that 
produce  poor  performance.  One  solution  is  to 
have  the  classifications  checked  by  a  human, 
trying  to  avoid  the  bias  of  knowing  the 
machine’s  classification  ahead  of  time.  Another 
solution  that  provides  an  automatic  update  of 
the  probabilities,  is  to  have  a  totally 
independent  classification  algorithm  just  used 
for  updating  purposes.  At  first  glance  this 
seems  like  at  least  double  the  effort  in 
designing  a  system,  but  often  there  are  tests 
(or  features)  that  are  very  sensitive  but  not 
specific.  These  would  not  be  that  useful  for 
classifying  each  event,  but  very  useful  for 
updating  the  conditional  probabilities.  In  other 
words,  P(F|D,)  does  not  have  to  updated  every 

time  Dj  is  diagnosed.  The  feature,  or  set  of 
features,  that  were  very  specific  but  not 
sensitive  could  be  used  to  pick  out  cases  that 
were  especially  likely  to  be  D|  and  update 

P(F|D|)  only  on  those  cases.  Forbes  et.  al.  at 
Hewlett-Packard  Laboratories  did  just  that  in  a 
Bayesian  computer  algorithm  to  classify 
ambulatory  electrocardiogram  waveforms  as 
being  normal  or  abnormal.  For  this  application, 
each  heart  beat  has  to  be  classified  in  real 
time.  They  use  features  of  the  waveform, 
such  as  polarity,  amplitude,  width,  phase,  etc., 
to  classify  each  beat.  The  program  is 
initialized  with  conditional  probabilities  based 
on  physiological  principles  and  general 
observations,  but  since  the  system  allows  for 
arbitrary  lead  placement,  the  shape  of  the 
normals  must  be  learned  very  quickly.  An 
independent  algorithm  comparing  relative 
widths  and  time  intervals  between  beats  is 


used  to  pick  out  a  subset  of  the  beats  that  are 
especially  likely  to  be  normal.  These  are  used 
to  update  P(Fjnormal).  It  is  important  for  this 
application  that  the  learning  continues,  since 
the  waveform  shapes  of  the  beats  can  change 
with  patient  movement.  They  chose  to  keep 
the  conditional  probabilities  for  the  abnormal 
beats  at  their  default  values  because  there  are 
many  types  of  possible  abnormal  beats.  Also, 
since  cost  of  misdiagnosis  and  prior  probability 
for  each  class  is  inversely  related  and  difficult 
to  assess,  that  portion  of  the  decision  rule  was 
also  left  static  at  values  that  optimized 
performance.  This  system  has  been  shown  to 
be  very  successful  at  smoothly  and  quickly 
adapting  to  changes  in  the  shape  of  normal 
beats.  The  overall  point  on  automating  the 
learning  of  probabilities  for  Bayesian  systems 
is  that  one  should  think  of  using  independent 
classification  algorithms  that  are  optimized  for 
updating. 

4.  Explanation 

Perhaps  the  most  important  feature  of  a 
medical  decision  aid  is  its  ability  to  provide 
quality  explanation  of  the  inference  and 
reasoning  used  in  coming  to  a  decision  or 
diagnosis.  This  is  a  major  focus  of  expert 
systems  research,  but  not  a  comfortable 
concept  for  designers  of  statistical  tools.  For 
complicated  medical  decisions,  physicians  are 
not  generally  content  to  be  shown  a  list  of 
regression  weights  as  justification  for  some 
sort  of  action  on  their  part.  Even  proven  good 
performance  does  not  always  inspire 
confidence,  since  each  new  case  is  seen  as 
having  some  unique  factors.  Physicians  are 
most  comfortable  with  explanations  that 
simulate  their  own  reasoning  strategies.  A 
couple  of  researchers  have  addressed  this 
problem,  and  have  provided  explanation 
capabilities  with  their  Bayesian  systems.  In  the 
gastroenterology  decision-support  system  of 
Spiegelhalter  and  Knill-Jones  there  is  a  display 
that  provides  a  complete  summarization  of  the 
decision  rule  in  an  easily  understandable 
format.  Evidence  for  a  disease  is  listed  on 
one  side  and  evidence  against  the  disease  is 
listed  on  the  other.  Each  feature  has  an 
accompanying  log-likelihood  ratio  that  serves 
as  its  score  If  the  score  is  positive,  then  the 
feature  and  score  go  under  "evidence  for"  the 
disease.  Otherwise,  it  is  evidence  against  the 
disease.  Since  the  log  of  the  likelihood  ratios 
are  shown,  the  scores  add,  and  it  is  easy  to 
see  the  relative  importance  of  each  finding. 
The  a  priori  probability  of  disease  is  also 
shown  in  log  form  and  is  added  in  to  obtain  a 
final  score.  This  is  converted  into  the 
resulting  probability  of  disease.  "Evidence  for" 
and  "evidence  against"  is  a  natural 
representation  of  information  for  physicians. 
The  scoring  may  appear  ad  hoc  to  the  casual 
user  of  the  system,  but  to  those  that 
understand  Bayesian  analysis,  a  complete 


representation  of  the  reasoning  is  presented  in 
a  useful  way.  Sometimes  system  users  may 
be  put  off  by  the  amount  of  detail  provided  by 
such  a  scoring  scheme  and  may  prefer  to  have 
it  represented  graphically  in  a  histogram. 

Reggia  and  Perricone  came  up  with  another 
form  of  explanation  for  their  Bayesian  system 
to  classify  strokes.  After  acquiring  values  for 
the  features  probabilities  for  the  various  types 
of  strokes  are  shown.  The  user  has  the  option 
to  justify  any  of  them.  Justification  includes  an 
optional  explanation  of  Bayes  formula,  and  a 
list  of  the  features,  their  values,  and  their 
scores.  Also  provided  is  a  list  of  features  with 
unknown  values  that  might  alter  the  results  if 
their  values  were  known. 

In  general,  it  seems  possible  to  provide  a 
good  explanation  and  useful  summarization  of 
the  analysis  in  a  Bayesian  system.  However, 
the  explanations  are  by  necessity  assoc iational, 
and  we  need  to  keep  in  mind  that  humans 
usually  reason  causally  for  explanation,  trying 
to  find  a  mechanism  or  process  that  explains 
the  observations. 

SUMMARY 

The  advantage  of  using  a  Bayesian  approach 
to  medical  decision  making  is  that  a  priori 
information  and  costs  of  misdiagnoses  can  be 
represented  explicitly  and  easily.  The 
concerns  usually  posed  regarding  Bayesian 
systems  do  not  seem  insurmountable.  Three 
assumptions  are  usually  made  for  practical 
reasons:  conditional  independence  of  features, 
mutual  exclusivity  of  diseases,  and  the 
assumption  that  the  set  of  diseases  are 
collectively  exhaustive.  It  was  shown  that  with 
careful  design,  the  problems  with  these 
assumptions  could  be  avoided.  It  was  also 
shown  that  subjective  probabilities  could  be 
used  in  a  Bayesian  system  giving  expert  level 
performance.  Finally,  automated  learning  of 
probabilities  and  explanation  features  are  new 
areas  for  Bayesian  systems,  and  the  systems 
reviewed  suggest  mechanisms  for  significantly 
improving  the  performance  and  acceptability  of 
Bayesian  systems  for  medical  applications. 


REFERENCES 

Atkinson,  P.,  Training  for  Certainty,  Soc.  Sci. 
Med.,  Vol.19,  No.  9,  pp.  949-956,  1984. 

Clancey,  W.J.,  Shortliffe,  E.H.,  Readings  in 
Medical  Artificial  Intelligence:  The  First 

Decade.  Menlo  Park,  CA:  Addison-Wesley. 

Forbes,  A.D.,  et.  a!.,  A  Dual  Channel  Bayesian 
Algorithm  for  Ambulatory  Electrocardiogram 
Analysis,  Proceedings  of  Computers  in 
Cardiology,  1986. 

Pople,  H.:  The  formation  of  composite 
hypotheses  in  diagnostic  problem-solving:  An 
exercise  in  synthetic  reasoning.  In 

Proceedings  of  the  Fifth  International  Joint 
Conference  on  Artificial  Intelligence,  pp 
1030-1037.  Pittsburgh,  PA:  Carnegie-Mellon 
University,  Department  of  Computer  Science. 
1977. 

Reggia,  J.A.,  Perricone,  B.T.,  Answer 
Justification  in  Medical  Decision  Support 
Systems  Based  on  Bayesian  Classification, 
Comput.  Biol.  Med.,  Vol.  15,  No.  4,  pp. 
161-167,  1985. 

Reiss,  Eric:  In  Quest  of  Certainty.  Am  J  of 
Med ;  1984;  77:969-971. 

Spiegelhalter,  D.J.,  Knill-Jones,  R.P.: 
Statistical  and  Knowledge-based  Approaches 
to  Clinical  Decision-support  Systems,  With  an 
Application  in  Gastroenterology.  J  R  Statist 
Soc  A;  1984;  147:35-77. 


i 


M?. 

I 

s 


k 


CONTROLLING  GRAPHICS 

Organizer:  Paul  Velleman,  Cornell  University 

MacSpin:  Graphical  Data  Analysis 

Andrew  W.  Donoho,  University  of  Texas ,  Austin;  David  L.  Donoho, 
University  of  California,  Berkeley;  Miriam  Gasko,  University  of  Chicago 

CRAP— A  Language  for  Statistical  Displays 

Jon  L.  Bentley,  Brian  W.  Kemighan,  AT&T  Bell  Laboratories 


bww.’w 


nromwe  lp.wj  b wcwto?  uijTjww.m'^  jiwi.'j.v.v 


MACSP1N:  GRAPHICAL  DATA  ANALYSIS 

Andrew  W.  Donoho,  University  of  Texas,  Austin 
David  L.  Donoho,  University  of  California,  Berkeley 
Miriam  Gasko,  University  of  Chicago,  G.B.S. 


Over  the  last  decade,  computer  graphics  researchers 
have  developed  systems  for  the  dynamic  display  of  data. 
These  interesting  and  stimulating  displays  showed  the 
potential  usefulness  of  interactive  graphics  in  the  analysis  of 
multivariate  data.  Videotapes  of  such  displays  showed 
instantaneous  graphical  responses  to  researchers'  queries. 
However,  these  systems  —  like  PRIM-9  (SLAC)  and  PR1M-H 
(Harvard)  and  the  Orion  (Stanford)  --  were  one-of-a-kind 
installations  accessible  to  only  a  very  few  people.  Because  of 
their  expense,  there  was  not  much  prospect  of  widespread  use 
of  such  systems;  as  long  as  their  primary  purpose  was  seen  as 
the  probing  of  theoretical  horizons  and  the  production  of 
videotapes  to  take  to  conferences,  such  important  issues  as 
ease-of-use  and  capability  for  working  with  real  data  on  an 
everyday  basis  could  be  ignored. 

A  useful  graphics  system  must  have  the  following 
properties;  first,  it  has  to  offer  a  graphical  "toolbox"  adequate 
for  interactive  exploratory  data  analysis.  Second,  its  user 
interface  has  to  be  simple  to  operate,  simple  enough  not  to 
distract  the  user  from  his  main  task,  analyzing  the  data  at 
hand.  Last,  but  not  least,  it  must  have  input  facilities  that 
accomodate  the  researcher's  existing  files  and  output  facilities 
that  help  him  document  his  findings.  We  describe  a  dynamic 
graphics  system,  MacSpin,  which  meets  these  requirements 
and  runs  on  an  inexpensive  desktop  computer  -  the  Macintosh. 

MacSpin  has  advanced  graphics  capabilities.  It  goes 
beyond  two-dimensional  x-y  plots,  and  lets  you  view  and 
interact  with  data  in  3  dimensions  -  and  more.  You  can  view 
x-y-z  plots,  rotating  them  in  real  time  to  get  a  true 
three-dimensional  perception  of  the  structure  of  your  data.  By 
means  of  animation,  you  can  make  movies  of  your  data  showing 
how  the  three-dimensional  cloud  varies  as  a  function  of  a 
fourth  variable.  MacSpin  is  a  useful  tool  for  identifying  trends, 
patterns,  clusters  and  outliers  in  high-dimensional  data.  The 


capability  for  augmenting  the  display  with  text  iriurmation 
allows  one  to  identify  datapoints  (e.g.  outliers)  and  correlate 
qualitative  information  with  the  patterns  you  observe  iri  the 
display. 

MacSpin  offers  easy  modes  for  inputting  data  and  for 
porting  data  from  other  programs  and  mainframes.  It  also  has 
useful  documentation  capabilities,  that  allow  one  to  produce 
hard  copy  of  screen  views,  or  to  insert  such  images  into  other 
computer  documents. 

And  all  these  operations  can  be  carried  out  by  MacSpin 
with  the  user  simply  pointing  and  clicking  with  the  mouse. 
Here  we  illustrate  the  main  features  of  the  program  by  telling 
the  story  of  its  application  to  two  datasets. 

The  Cars  Story 

The  first  dataset  we  consider  consists  of  all  the  cars 
road-tested  by  Consumer  Reports  magazine  between  1971  and 
1983.  The  data  will  help  us  see  how  the  auto  industry  changed 
over  the  last  decade  or  so.  The  names  of  the  418  Cars  are  listed 
in  the  events  window  on  the  MacSpin  display  (lower  right): 
the  portion  we  see  includes  the  Plymouth  Barracuda  and 
Plymouth  Fury  III,  cars  from  the  early  1970's.  The  variables 
window  (partially  obscured  by  the  events  window)  shows  the 
variables  we  have  measured  for  each  car;  things  involving 
performance  (Gallons  per  Mile,  Seconds  to  reach  60  MPH  from 
a  full  stop),  size  (horsepower,  weight,  ...),  and  miscellaneous 
(model  year,  continent  of  origin). 

X-Y  Plots.  The  view  in  the  plot  window  shows  all  the 
cars  (American,  European,  and  Japanese)  in  an  x-y  plot,  with 
x=Gal/Mi  (i.e.  fuel  usage  per  mile)  versus  y=slowness  (Secs. 
0-60).  The  points  represent  individual  cars.  By  moving  the 
cursor  to  a  point  and  clicking,  we  can  find  out  its  identity.  The 
point  at  the  upper  left  (slow  but  economical)  is  a  VW  pickup; 
the  point  in  the  lower  right  (fast  gas-guzzler)  is  a  Plymouth 
Barracuda.  By  holding  down  the  control  and  option  keys  as  we 
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Figure  1:  Gasoline  consumption  and  acceleration  lime  for  418 
cars  reviewed  by  Consumer  Reports  between  1973  and 
1981. 


identify,  the  full  data  record  pops  up.  This  shows  us  that  the 
fast,  efficient  car  in  the  lower  left  is  a  1981  Datsun  280  Z X. 


Figure  2:  Information  pop-up  for  the  Datsun  280zx. 


X-Y-Z  Plots.  The  x-y  plot  shows  the  general  trend  of  the 
auto  industry  -  what  combinations  of  speed  and  economy  are 
available.  By  rotating  the  plot,  we  can  get  an  extra  dimension 
into  the  display.  We  point  at  a  rotation  icon  on  the  far  right, 
and  hold  down  the  mouse  button.  The  2-d  plot  becomes  a 
rotating  3-d  plot,  the  previously  hidden  z-variable  coming 
into  play.  When  we  do  this  for  the  Cars  data,  and  bring 
z=weight  into  the  display,  we  see  a  cloud  of  points  rotating 
smoothly  in  space.  The  cloud  is  shaped  like  a  sausage  and 
shows  the  combinations  of  economy,  speed,  and  weight  being 
built  during  the  1971-1983  period.  As  we  rotate  this,  we  notice 
a  few  interesting  things.  First,  one  point  turns  out  to  be  an 
outlier.  We  stop  the  rotation  and  identify  it;  it  is  an 
International  Harvester  truck.  Somehow  a  truck  has  slipped  in 
to  a  database  on  Cars!  When  we  scroll  to  the  truck's  name  in 
the  events  window,  we  see  that  Consumer  Reports  road-tested 
a  few  other  trucks,  too.  By  pointing  at  their  names  on  the  list. 


Figure  4:  Highlighting  subsets  of  4,  6.  and  8  cylinder  cars. 


Animation  permits  us  to  study  the  effect  of  a  fourth 
variable  on  a  display.  Suppose  we  are  interested  in  how  the 
American  auto  industry  has  changed  over  time.  We  can 
highlight  the  American  cars,  and  then  select  Focus  from  the 
events  menu.  MacSpin  now  temporarily  excludes  imported  cars 
from  the  display.  We  then  drag  the  Year  variable  to  the 
scroll  bar  in  the  lower  left.  This  will  let  us  scroll  through  the 
data  model-year  by  model-year.  We  begin  at  1971.  The  cars 
made  then  are  concentrated  in  the  lower  left  of  the  display: 
fast,  heavy,  gas-guzzling  cars.  As  we  scroll  smoothly  forward, 
we  see  that  the  data  drift  systematically  towards  the  upper 
left  -  toward  slower,  lighter,  more  economical  machines. 


we  can  highlight  them  in  the  plot  window.  They  arc  also 
outliers.  By  choosing  "Exclude"  from  the  events  window,  we 
can  (temporarily)  remove  them  from  the  display.  The  rotation 
has  helped  us  identify  and  remove  outliers. 
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Figure  3:  Highlighting  and  excluding  trucks  from  the  display. 

Highlighting  Subsets.  Further  rotation  shows  that  the 
data  consist  of  three  clusters.  Seeking  for  an  explanation,  we 
bring  the  Subsets  window  to  the  front.  This  shows  some  subsets 
of  the  data  predefined  (by  us)  as  being  interesting  to  look  at. 
By  pointing  at  the  name  of  any  subset,  we  can  highlight  its 
members  on  the  display.  When  we  do  this,  we  sec  that  the  3 
clusters  consist  of  8,  6,  and  4  cylinder  cars,  respectively.  We 
could  also  highlight  American,  European,  and  japanesc  subsets 
in  turn,  and  find  out  where  they  are  on  the  display. 


Figure  5:  Animation  showing  changes  in  the  performance 
of  American  cars  over  time:  the  years  1971 , 1978,  and  1983  are 
shown. 

Transformations.  The  researcher  can  also  transform 
existing  variables  to  create  new  ones.  Features  like  this  make 
MacSpin  useful  not  just  for  displaying  data  but  also  for 
manipulating  it  to  get  the  right  display.  We  just  saw  that  cars 
became  more  economical  over  the  period  1973-1981.  Did  they 
just  become  lighter  and  smaller,  or  was  there  an  actual  increase 
in  mechanical  efficiency?  Dividing  Gal/Mi  by  Weight  gives  us 
a  standardized  measure  of  fuel  efficiency  in  which  the  effects 
of  weight  are  taken  out.  Looking  at  plots  with  this  new 
variable  would  give  us  insight  into  whether  American  cars  got 
more  efficient  or  whether  they  just  got  smaller  over  this 
period.  Variable  transformations  are  all  included  in  a  special 
transformations  window,  and  executed  by  pointing  and  clicking 
with  the  mouse. 
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Figure  6:  Transformations  menu  window. 


Markers.  MacSpin  also  makes  it  easy  to  get  hard  copy 
of  data  displays.  (Since  you  can  mark  subsets  with  special 
symbols,  you  can  use  these  to  convey  some  of  what  the  dynamic 
exploration  showed  you.)  "Screen  dumps"  are  generated  using 
the  Command-Shift-3  sequence.  The  figure  below  is  derived 
from  a  screen  dump.  1971  model  cars  are  marked  with  a  box, 
and  1983  model  cars  with  an  asterisk.  The  resulting  image  was 
cropped,  and  shadows  and  captions  were  drawn  in,  using 
MacPaint. 


as  overt  diabetic,  chemical  diabetic,  or  normal.  Age  and 
Relative  Weight  turned  out  to  be  unimportant  hence  are 
excluded  from  our  analysis  (as  they  were  by  Reaven  and 
Miller). 

The  opening  view  of  our  demonstration  has  the 
following  variables  assigned  to  the  three  axes:  Fast  Glucose  on 
the  X-axis,  Test  Glucose  on  the  Y-axis  and  Test  Insulin  on  the 
Z-axis.  (Figure  8). 
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Figure  8:  Glucose  measurements  of  145  patients  who 
underwent  a  glucose  tolerance  test. 


This  view,  showing  the  data  distributed  in  a 


Figure  7:  American  cars,  with  special  markers  given  to 
model  years  1971  and  1983. 


The  Diabetes  Story. 

These  data  were  provided  by  Reaven  and  Miller  of 
Stanford  University.  The  original  graphic  analysis  of  it  was 
done  on  the  PRIM-9,  and  reported  in  Diabetologica  in  1979. 
The  data  describe  145  nonobese  and  nonketotic  patients  who 
agreed  to  participate  in  a  medical  experiment.  The  purpose 
was  to  assess  the  relationships  among  various  measures  of 
plasma  glucose  and  insulin  in  order  to  illuminate  the  etiology 
of  "Chemical"  and  "Overt"  diabetes.  Each  patient  underwent 
a  glucose  tolerance  test,  and  the  following  quantities  were 
measured:  Age,  Relative  Weight,  Fasting  Plasma  Glucose,  Test 
Plasma  Glucose  (a  measure  of  insulin  intolerance).  Steady 
State  Plasma  Glucose,  and  Plasma  Insulin  during  the  test.  In 
addition  we  have  the  doctors'  classification  of  these  patients 


sausage-shaped  cloud,  supports  the  interpretation  that  there 
is  but  one  direction  in  which  abnormality  develops,  as  we 
progress  form  normal  patients  to  Chemical  to  Overt  diabetics. 
However,  as  soon  as  we  start  to  rotate  the  data  around  the 
X-axis,  and  tilt  it  a  bit  to  better  show  the  third  dimension,  Z 
(Figure  9),  we  can  see  the  pointdoud  has,  in  fact,  the  shape  of 
a  boomerang.  We  can  no  longer  accept  that  there  is  just  ore 
direction  of  disease  developoment. 


Figure  9:  Rotation  showing  the  boomerang  aspect  of  the 

data. 


The  most  natural  question  at  this  point  is:  What  makes 
the  two  arms  of  the  boomerang?  As  the  doctors  have  classified 
the  diabetic  patients  as  either  Chemical  or  Overt,  we  can 
highlight  each  subset  separately.  As  Figure  10  shows,  each 
arm  corresponds  to  one  of  the  groups.  We  can  also  mark  the 


Figure  10:  Chemical  and  overt  diabetics  shown 
occupying  the  two  arms  ol  the  boomerang. 


points  corresponding  to  Overt  diabetics  with  x's,  and  points  of 
Chemical  diabetics  with  diamonds  (Figure  11). 


Figure  11:  The  glucose  measurements  with  the 
"Chemical"  and  "Overt"  subsets  marked. 


The  examples  show  MacSpin's  usefulness  in  exploratory  data 
analyis:  how  its  dynamic  graphics  can  reveal  data  structures 
and  the  answers  to  focused  questions  about  data.  Among  the 
important  features  we  illustrated  were: 

•  rotation  to  show  a  third  dimension 

•  identifying  interesting  points 

•  highlighting  important  subsets 

•  animation  to  look  at  a  fourth  variable 

•  transforming  the  data 

•  marking  subsets 

Our  live  demonstrations  of  these  examples  at  the  Interface 
conference  testified  to  the  system's  ease  of  use.  While 
MacSpin  is  not  a  replacement  for  standard  statistical 
procedures  (and,  hence,  has  been  designed  to  facilitate  the 
porting  of  files  between  programs  and  mainframes),  it  is  a 
valuable  addition  to  the  data  analyst's  "kit  of  tools". 
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Whatever  our  notation,  our  conclusion  is  that  Chemical 
and  Overt  diabetes  are  two  different  syndromes,  not  just  one 
manifested  at  different  levels  of  intensity. 
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ABSTRACT 

This  paper  describes  Unix®  tools  for  preparing  publication-quality  graphical  displays.  A 
general-purpose  graphing  language,  CRAP,  provides  for  automatic  scaling  and  tick  marks,  input  data 
transformation  and  processing,  multiple  independent  coordinate  systems,  and  multiple  graphs  in  a 
single  display.  Specialized  languages  (implemented  as  GRAP  preprocessors)  deal  with  specialized 
graphs,  such  as  dotcharts,  box  plots  and  scatter-plot  matrices. 

Although  originally  designed  for  document  preparation,  GRAP  has  been  used  for  such  diverse 
tasks  as  exploratory  data  analysis  and  prototyping  new  graphical  displays. 


1.  Introduction 

The  Unix*  operating  system  includes  a  family  of 
tools  for  document  preparation.  The  basic  tool  is  a 
venerable  text  formatter  called  TROFF.  That  formatter 
does  not  deal  directly  with  complicated  material  like 
mathematics  and  tables.  Instead,  specialized  kinds  of 
typesetting  are  handled  by  preprocessors  that  translate 
specialized  languages  into  TROFF  commands.  For  exam¬ 
ple,  a  language  called  EQN  translates  expressions  like 

X  bar  *  1  over  n  sum  from  i»1  to  n  f  sub  i  x  sub  i 

—  1  " 

into  X  =  —  2/jXj.  Other  languages  include  TBL  for 

K  i  =  i 

specifying  tables,  and  PIC  for  drawing  simple  line 
diagrams.  The  Unix  document-preparation  tools  are 
described  in  [1];  a  survey  of  the  field  can  be  found  in 
(2). 

One  area  not  served  by  the  suite  of  programs 
mentioned  above  is  the  graphical  display  of  data.  In 
most  document  preparation  systems,  the  only  way  to 
include  a  graph  is  by  (mechanical  or  electronic)  cutting 
and  pasting  of  a  separately  prepared  figure.  The  GRAP 
language  [3, 4]  was  designed  to  make  it  easy  to  describe 
graphs  and  to  include  them  in  documents  prepared 
with  TROFF  and  related  programs.  This  paper  was 
typeset  by  those  tools,  without  benefit  of  scissors  or 
paste,  physical  or  electronic. 

This  paper  will  describe  the  elements  of  the  GRAP 
language,  and  illustrate  its  use  as  a  vehicle  for  experi¬ 
menting  with  new  forms  of  statistical  display  and 


packaging  them  for  convenient  use. 

2.  The  GRAP  Language 

In  its  simplest  use,  GRAP  converts  a  set  of  x,y 
pairs  into  a  scatter  plot,  generates  ticks  automatically, 
and  puts  the  result  in  a  standard  frame.  Given  pairs 
showing  remaining  life  expectancy  as  a  function  of  age, 
GRAP  produces  this  (simple)  plot: 


60 


40 


20 


A  graph  is  often  part  of  a  larger  document.  The  parts 
of  the  document  intended  for  GRAP  are  delimited  by  the 
commands  .G1  and  .G2;  other  text  is  copied  through 
untouched.  The  input  for  the  graph  above  is  just  the 
data  itself  (the  ellipsis  marks  omitted  data  items): 
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•  G1 
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5  69.8 

10  64.9 

85  6.5 

.G2 

GRAP  translates  graph  specifications  into  PIC  com¬ 
mands.  To  format  a  document  containing  graphs,  one 
would  normally  use  a  pipeline  of  commands  such  as 

grap  filenames  !  pic  I  troff 

The  default  display  may  be  refined  by  specifying 
more  parameters.  Labels  may  be  added  on  any  side, 
ticks  may  be  defined  by  an  explicit  list  or  an  iterator, 
data  may  be  copied  from  a  separate  file,  and  the  points 
may  be  connected  by  lines  of  various  styles: 

label  bottom  "Present  Age" 

label  left  "Remaining"  "Years"  left  .1 

ticks  left  from  0  to  70  by  10 

frame  ht  1.3  wid  2  top  invis  right  invis 

draw  solid 

copy  "life.d" 
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Present  Age 

The  file  life.d  contains  the  age-expectancy  data 
shown  above.  The  clause  left  .  1  moves  the  text  from 
its  default  position  by  that  many  inches.  The  .G1  and 
.G2  delimiters  are  not  shown  in  this  and  subsequent 
inputs. 

The  core  of  GRAP  includes  commands  for  plotting 
arbitrary  text  at  any  point,  drawing  arbitrary  lines  and 
arrows,  setting  the  range  and  optional  logarithmic  scal¬ 
ing  of  coordinate  axes  explicitly,  and  drawing  grid  lines. 

GRAP  does  not  provide  a  large  variety  of  built-in 
graph  types.  Rather,  it  offers  primitive  operations  out 
of  which  many  different  graphs  can  be  built.  One  of 
he  most  important  of  these  primitive  operations  is  a 
simple  macro  processor.  The  statement 

define  name  (  replacement  text  ) 

defines  a  macro.  Subsequent  occurrences  of  name  will 
be  replaced  by  the  replacement  text.  Instances  of  $1,  $2, 
etc.,  in  the  replacement  text  will  be  replaced  by  the 
corresponding  arguments  in  a  macro  call  like 
name{arg,,arg2l ...). 


To  illustrate,  consider  plotting  expected  age  at 
death  rather  than  remaining  years,  for  which  the  y  coor¬ 
dinate  is  the  sum  of  age  and  expectancy: 


Expected  Age 
at  Death 


Present  Age 


frame  ht  1.3  wid  2 

label  bottom  "Present  Age" 

label  left  "Expected  Age"  "at  Death"  left  . 1 
define  show  <  bullet  at  *1,  $1**2  } 
copy  "life.d"  through  show 

In  a  copy  statement,  each  line  of  the  source  file  is  con¬ 
verted  into  a  call  of  the  specified  macro,  with  each  field 
becoming  the  corresponding  argument.  In  fact,  it  is  not 
necessary  to  define  the  macro  separately: 

copy  "life.d"  through  {  bullet  at  $1,  *1+$2  } 

is  equivalent,  and  notationaliy  more  convenient. 

As  this  example  suggests,  GRAP  provides  the  abil¬ 
ity  to  do  arithmetic,  both  on  input  data  and  on  vari¬ 
ables.  It  also  has  an  if -else  statement  and  a  for 
loop. 

It  is  possible  to  show  multiple  curves  on  a  single 
plot;  each  set  of  values  is  independently  scaled  and 
plotted.  For  example,  this  graph  plots  a  second  set  of 
data  that  shows  the  fraction  of  an  original  100  people 
still  alive  at  the  given  age: 

v  Percent  Surviving  — 100 
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frame  wid  2.5 
label  bot  "Age" 

ticks  bot  from  survivors  0  to  80  by  10 
ticks  left  from  expectancy  0  to  70  by  10 
ticks  right  from  survivors  0  to  100  by  25 
draw  expectancy  solid 
draw  survivors  dotted 
copy  "life3.d"  through  { 

next  expectancy  at  expectancy  $1,  $2 
next  survivors  at  survivors  41,  $3 

1 

"Percent  Surviving"  at  survivors  65,  100 
"Remaining”  "Years"  at  expectancy  5,  50 


using  PIC  positioning  commands  such  as  with. 

The  examples  above  show  that  GRAP  gives  the 
user  a  great  deal  of  freedom  in  preparing  x,y  plots  in 
standard  forms.  It  has  also  proven  to  be  a  useful  tool 
for  experimenting  with  display  formats.  The  file 
cars .  d  contains  the  mileage  (miles  per  gallon)  and  the 
weight  (pounds)  for  74  models  of  automobiles  sold  in 
the  United  States  in  the  1979  model  year.  A  simple 
scatter  plot  shows  that  as  mileage  increases,  weight 
decreases  nonlinearly.  A  more  interesting  graph  shows 
that  inverse  mileage  (gallons  per  mile)  is  proportional  to 
weight. 


Data  or  parameters  may  be  plotted  in  a  particular  coor¬ 
dinate  system  by  placing  the  name  of  that  system  before 
the  scalar  value  or  x,y  pair. 

One  of  the  most  useful  features  of  GRAP  is  the 
ability  to  combine  several  subgraphs  into  one  overall 
graph.  As  a  simple  example,  the  life  expectancy  and 
survivor  data  above  may  be  plotted  as  two  separate 
graphs  with  a  common  x  axis: 

graph  Exp 

frame  ht  1.25  wid  2 

ticks  left  from  0  to  70  by  10 

tick  bottom  off 

label  left  "Remaining"  "Years"  left  . 1 
draw  solid 

copy  "life3.d“  through  {  41,  42  1 
graph  Frac  with  .Frame. north  at  Exp . Frame . south 
frame  ht  1.25  wid  2 
ticks  left  from  0  to  100  by  25 
label  left  "Percent"  "Surviving"  left  . 1 
draw  solid 

copy  "life3.d"  through  {  41,  43  } 
label  bottom  "Age” 
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The  top  ticks  denote  the  extremes,  quartiles,  and 
median.  The  graph  was  generated  by 

frame  ht  2.2  wid  2.2 

coord  x  0,  0.1  y  0,  5000 

label  left  "Weight"  "(pounds)"  left  .2 

label  bot  "Gallons  per  Mile" 

ticks  bot  from  0  to  0.10  by  0.02 

label  top  "Miles  per  Gallon" 

ticks  top  at  1/12  "12",  1/18  "18",  \ 

1/20  "20",  1/25  "25",  1/41  "41" 
copy  "cars.d"  thru  {  circle  at  1/41,42  1 

In  [5],  Tufte  proposes  the  "dot-dash-plot"  as  a 
means  for  maximizing  data  ink  (showing  the  two- 
dimensional  distribution  and  the  two  one-dimensional 
marginal  distributions)  while  minimizing  what  he  calls 
"chart  junk"  —  ink  wasted  on  borders  and  non-data 
labels.  His  graph  is  easy  to  execute  in  GRAP: 


The  graph  statement  defines  a  subgraph  with  its  own 
coordinate  systems,  data,  etc.  Subgraphs  may  be  posi¬ 
tioned  arbitrarily  with  respect  to  previous  subgraphs 
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frame  invis  ht  2  wid  2 
coord  x  0,  0.1  y  0,  5000 
copy  "cars.d"  thru  { 
tx  ■  1/SI;  ty  »  *2 
bullet  at  tx.ty 
tick  bot  at  tx  "" 
tick  left  at  ty  "" 

1 

which  produces: 
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The  simplest  example  is  a  language  for  describing 
"dotcharts"  or  "lolliplots"  (6).  The  following  dotchart 
shows  total  Northern  and  Southern  casualties  (killed 
and  wounded)  in  the  major  battles  of  the  Civil  War. 
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Although  visually  attractive,  we  do  not  find  the  result¬ 
ing  graph  as  useful  for  interpreting  the  data  as  the  first 
representation.  Tufte's  graph,  however,  does  point  out 
two  facts  not  obvious  in  the  previous  format:  there  is  a 
gap  in  car  weights  near  3000  pounds  (exhibited  by  the 
hole  in  the  y-axis  ticks),  and  the  gallons  per  mile  axis  is 
regularly  structured  (the  ticks  are  the  reciprocals  of  an 
almost  dense  sequence  of  integers). 

A  word  on  implementation:  GRAP  is  implemented 
as  a  preprocessor  for  PIC  so  as  to  take  advantage  of 
PIC's  features  for  plotting  and  positioning  text.  GRAP 
itself  handles  collection  of  data,  maintains  the  indepen¬ 
dent  input  coordinate  systems,  and  scales  the  outputs  in 
each. 

The  language  is  specified  with  a  YACC  grammar 
and  the  processor  is  written  in  C;  it  is  about  3000  lines 
of  code  altogether.  GRAP  went  from  initial  conception 
to  use  by  people  other  than  the  authors  in  about  a  week 
and  to  books  and  published  papers  within  several 
months.  The  total  software  development  time  is 
perhaps  three  or  four  person-months. 

3.  A  Language  for  Dotcharts 

Macros  provide  one  way  to  encapsulate  a  compli¬ 
cated  or  lengthy  sequence  of  GRAP  commands.  More 
interesting,  however,  is  the  notion  of  a  "little 
language."  If  a  particular  class  of  graph  is  used  fre¬ 
quently,  one  can  design  a  small  language  for  describing 
instances  of  the  class,  and  implement  a  "compiler"  that 
translates  from  that  specialized  language  into  GRAP 
statements.  In  that  case,  GRAP  serves  as  an  assembly 
language. 


Casualties  of  the  Civil  War 

This  straightforward  GRAP  program  produces  that 
dotchart: 

label  "Casualties  of  the  Civil  War" 
coord  x  0  to  45000 
ticks  left  off 
nr  >  o 

copy  "civwar.d"  through  { 
nr  *  nr  ♦  1 
yval  «  -nr 

line  dotted  from  O.yval  to  SI, yval 
bullet  at  SI, yval 
S2  rjust  at  -0.02, yval 
lastx  «  SI 

} 

""  at  0,0;  ""  at  0,-(nr+1) 
frame  ht  (nr+1 )«0. 125  wid  2.5 

The  data  items  are  counted  as  they  are  printed;  the 
frame  height  is  computed  and  the  frame  drawn  after  the 
data  has  been  plotted. 

If  one  is  preparing  only  a  few  dotcharts,  each  can 
be  built  with  a  text  editor.  If  there  are  a  large  number 
of  similar  graphs  to  be  prepared,  however,  it  is  prob¬ 
ably  worth  automating  the  job.  We  therefore  designed 
a  DOTCHART  language  in  which  one  can  specify  a  large 
class  of  dotcharts.  The  DOTCHART  "compiler"  reads  a 
dotchart  specification  and  generates  GRAP  commands 
(only  slightly  more  complex  than  those  above)  to  print 
the  desired  figure.  For  example,  the  dotchart  above 
was  specified  as 
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file  "civwar.d" 

label  ’Casualties  of  the  Civil  War* 

spread  .125 

coord  x  0  to  45000 

width  2.5 

quoted 

The  file  command  specifies  that  input  comes  from  a 
file  civwar .  d: 

40000  "Gettysburg" 

36000  "Seven  Days" 

30000  "Petersburg  Siege” 

28500  "Chickamauga" 

28000  ’The  Wilderness" 

2270  "Kennesaw  Mtn" 

Other  commands  set  parameters  of  the  dotchart  as 
needed,  and  any  remaining  lines  (such  as  label  and 
coord)  are  assumed  to  be  GRAP  commands  that  make 
sense  in  context;  they  are  copied  through  verbatim. 

The  implementation  of  DOTCHART  is  noteworthy 
mainly  for  its  small  size  —  the  first  version,  adequate 
for  dotcharts  like  the  one  above,  is  less  than  25  lines 
long  (see  (3]).  It  is  written  in  AWK,  a  general-purpose 
string  processing  language  [7]. 

With  the  basic  design  of  DOTCHART  in  hand,  it  is 
easy  to  add  features  that  express  variations  on  the 
theme.  For  example,  Cleveland  advocates  dotted  lines 
that  go  all  the  way  across  the  chart  when  the  baseline  is 
at  zero.  Four  lines  of  AWK  code  in  the  DOTCHART  com¬ 
piler  and  another  parameter  in  the  language  implement 
the  new  style  guide  across: 

Cheetah 
Antelope 
Wildebeest 
Lion 
Gazelle 
Horse 
Elk 
Coyote 
Hyena 
Zebra 
Greyhound 
Rabbit 
Deer 
Jackal 
Giraffe 
Warthog 
Grizzly 
Cat 
Human 
Elephant 
Squirrel 

Pig 

Chicken 

10  20  40  80  160 

Maximum  speed,  km/hr 


file  "aninal.d" 

label  "Naximun  speed,  km/hr" 

coord  x  10  to  120  log  x 

ticks  bot  at  10,  20,  40,  80,  160 

spread  . 125 

width  2.5 

guide  across 

4.  Other  Specialized  Languages 

DOTCHART  is  not  the  only  little  language  that 
prepares  input  for  GRAP,  although  it  was  perhaps  the 
easiest  to  implement.  Another  language,  SCATMAT,  is 
used  for  describing  scatter-plot  matrices  [8]. 

Given  a  set  of  n  observations  of  k  attributes,  a 
scatter-plot  matrix  is  a  kxk  array  of  scatter  plots.  For 
example,  this  file  contains  data  on  the  nine  planets:  dis¬ 
tance  from  the  sun,  temperature,  mass  and  radius: 


0.4 

600 

.05 

.4 

0.75 

370 

.8 

1 

1 

330 

1 

1 

1.5 

300 

.11 

.5 

5 

140 

318 

11 

40 

50 

1 

.5 

(Temperature  is  degrees  Kelvin;  other  attributes  are  nor¬ 
malized  to  earth's  value.)  This  scatter-plot  matrix 
shows  the  six  pairwise  relationships  among  the  four 
variables: 
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Although  it  would  be  possible  to  specify  a 
scatter-plot  matrix  "by  hand"  using  CRAP'S  facility  for 
defining  subgraphs,  it  would  require  an  inordinate 
amount  of  work,  much  more  than  for  a  dotchart.  Thus 
we  designed  another  language,  again  to  be  processed 
into  GRAP  by  a  small  compiler,  also  written  in  AWK. 


I  »*#  t.i  * 


I  tat  > 


|  4J  Si  tj.  *«>  M  i.!  fa» 


The  input  language  for  SCATMAT  is  similar  to  the 
DOTCHART  language: 

file  "planets.d" 
f rases  ht  .75  wid  .75 
spread  0 
alllog 

name  Distance 
field  *1 

naae  Teaperature 
field  *2 
naae  Mass 

field  S3 
naae  Radius 
field  $4 

The  first  version  of  SCATMAT  was  about  35  lines 
of  AWK;  the  current  version  is  about  100  lines.  It  pro¬ 
vides  many  more  parameters,  and  can  also  deal  with 
variations  like  using  the  other  diagonal  and  printing 
only  one  triangle  of  the  matrix. 

It  is  useful  to  build  languages  for  other  particular 
graphs  as  well;  we  have  done  so  for  box  plots,  and  have 
seen  one  for  pie  charts.  Such  languages  are  easy  to 
build  and  can  be  easy  to  use  because  the  common  out¬ 
put  language  and  (usually)  common  implementation 
encourage  a  similar  style. 

5.  Conclusions 

Our  original  goal  was  a  language  for  preparing 
publication-quality  graphs.  That  goal  has  been 
achieved:  with  the  addition  of  GRAP  to  the  Unix  docu¬ 
ment  preparation  tools,  we  are  now  able  to  produce 
complicated  graphical  displays  with  little  effort.  The 
quality  is  acceptable  for  books.  Examples  may  be  found 
in  Cleveland  [6],  Aho,  Sethi  and  Ullman  (9]  and  Bentley 
[101- 

GRAP  has  also  proven  useful  for  exploratory  data 
analysis,  even  though  that  was  not  our  intent.  This  is 
certainly  not  because  it  runs  fast  (for  most  graphs  it  is 
much  slower  than,  for  example,  the  S  system  [11]),  but 
apparently  because  its  textually  based  interface  fits  well 
with  other  Unix  tools.  It  is  easy  to  prepare  data  with 
some  program,  massage  it  into  the  right  format  (either 
with  a  general  tool  like  AWK  or  with  the  input  process¬ 
ing  of  GRAP  itself),  then  plot  it  to  see  what  things  look 
like. 

GRAP  has  also  turned  out  to  be  surprisingly  useful 
for  prototyping  statistical  displays.  It  has  built-in  facili¬ 
ties  for  both  display  and  computation,  and  provides  an 
easy  escape  to  the  Unix  environment  when  the  built-in 
mechanisms  are  not  adequate. 

As  a  more  general  observation,  many  tasks  can  be 
profitably  approached  by  designing  and  implementing  a 
"little  language"  specialized  to  that  task.  Users  can 
thereby  express  their  solutions  in  terms  closely  related 
to  their  view  of  the  problem.  Specialized  languages  for 
graphs,  dotcharts,  and  scatter-plot  matrices  are  merely 
examples  from  one  domain.  Indeed,  the  entire  family 
of  Unix  document  preparation  programs  consists  of 


such  little  languages,  some  feeding  TROFF  directly, 
while  others  compile  into  intermediate  languages. 

In  most  of  these  languages,  it  appears  necessary 
to  provide  some  degree  of  programmability;  otherwise, 
users  are  restricted  to  those  things  that  the  implementor 
thought  of.  For  GRAP  especially,  the  ability  to  program 
the  processor  to  define  a  new  style  has  proven  invalu¬ 
able. 
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1.  An  Empirical  Bayes  Model  for  Image  Data. 

Motivated  by  the  LANDSAT  problem  of  estimating 
the  probability  of  crop  or  geological  types  based  on 
multi-channel  satellite  imagery  data,  Morris  and  Kostal 
(1983),  Hill,  Hinkley,  Kostal,  and  Morris  (1984),  and 
Morris,  Hinkley,  and  Johnston  (1985),  henceforth  la¬ 
beled  MK83,  HHKMS4  and  MHJ85,  developed  an  em¬ 
pirical  Bayes  approach  to  this  problem.  We  return  here 
to  those  developments,  making  certain  improvements 
and  extensions,  but  restricting  attention  to  the  binary 
case  of  only  two  attributes. 

Label  the  pixels  in  a  rectangular  lattice  as  i  =  (j,  fc), 
j  =  1,2, ...,J  and  fc  =  1,2, Each  of  these 
n  =  JK  pixels  has  attribute  0,  taking  values  0,  =  0 
or  Oi  =  1  to  indicate  only  two  possible  distinct  types.  A 
one-time  vector  D,  of  measurements  is  available  for  each 
pixel,  usually  involving  several  bandwidths  for  several 
time  points.  In  this  simple  version  a  one-dimensional 
function  y*  of  D,  is  all  that  will  be  considered  (as  noted 
in  MHR85,  if  Di  is  multidimensional,  then  y,  is  the  best 
one-dimensional  summary  if  it  is  chosen  as  the  loga¬ 
rithm  of  the  likelihood  ratio  of  Di  for  0*  =  1  and  0,  =  0). 

An  empirical  Bayes  model  is  defined  as  one  that  pro¬ 
vides  two  families  of  distributions,  one  for  the  data,  con¬ 
ditional  on  the  parameters,  and  one  for  the  parameters. 
The  descriptive  empirical  Bayes  model  specifies  distri¬ 
butions 

(1)  P(y  I  0)  for  the  data  { y, } ,  conditional  on  the  un¬ 
known  parameters  {0;},  i.e.,  the  likelihood  func¬ 
tions,  and 

(2)  a  parametric  family  po(0)  of  distributions  for  the 
parameters,  indexed  by  hyperparameters  a  €  A. 

The  inferential  empirical  Bayes  model  is  mathematically 
equivalent  to  the  descriptive  model,  but  respecifies  the 
distributions  as 

(1*)  Pa(y),  the  marginal  distribution  for  the  data,  now 
dependent  on  a  6  A,  and 

(2*)  p„(0  |  y),  the  conditional  distribution  for  the  pa¬ 
rameters  {0j}  given  the  data  {yi}  an d  a. 

The  distributional  choices  here  are  the  same  as  for  MHR 
85.  The  key  simplifying  assumption  is  that  the  corre¬ 
lations  between  observed  measurements  enters  entirely 
through  the  parameters,  {0,}  the  observed  data  {y,} 
being  conditionally  independent,  given  {0;} 

DESCRIPTIVE  MODEL: 

(1)  Given  0j,  i  =  1, . . . ,  n,  0,  =  0  or  1,  assume 

yi  |  0j  ~  A'(0(0,  -  .5),  1)  independently,  with  is  a 
known  constant. 


(2)  The  {0,},  in  the  wide  sense,  have  a  stationary, 
isotropic  distribution  on  the  lattice  with 

»  =  P(#.  =  1)  all  *'  =  ( j,  *), 

and  for  —  s  <  t,u  <  s, 

p,tU  =  Corr(0>it,0i+(ifc+u),  all  j,k. 

Some  comments  on  this  model  are  required. 

The  distribution  (1),  which  says  yi  has  mean  ±6/2 
and  unit  variance  is  equivalent  via  location  and  scale 
changes  to  any  model  for  yt  giving  yi  a  normal  distri¬ 
bution  with  means  po  or  p\  when  0,  =  0  or  0,  =  1  and 
variances  of  =  cr\  =  <72.  Then  0  =  (p i  —  po)/o-  The 
parameters  (p0 ,  pi ,  a)  and  the  form  (normal)  of  the  den¬ 
sity  of  yi  are  assumed  known.  In  practice,  these  would 
be  known  based  on  vast  experience  with  “training  data” 
(where  the  values  of  0i  could  be  observed  along  with  y,). 

The  stationary  assumption  (2)  for  the  parameters  jus¬ 
tifies  letting  the  parameters  ir  and  pt<u  be  independent 
of  t  =  (j,  fc).  This  assumption  needs  to  hold  only  in 
the  wide-sense  because  the  inferential  methods  used  do 
not  involve  more  than  these  first  and  second  moments. 
The  isotropic  assumption  would  serve  further  to  sim¬ 
plify  pt,u,  and  we  do  identify  p(i„  =  pn,t  =  P|»|,|„|,  but 
do  not  take  full  advantage  of  the  rotational  invariance. 
The  hyperparameters  a  then  are  r +  1  =  (s  +  l)(s  +  2)/2 
dimensional.  For  example,  if  s  =  2,  then  r  =  5  and 

a  =  («l,OT2,.  .  .  ,  Os)  =  (*,  010,  01 1,020,  021, 022)  • 

It  is  important  to  realize  that  the  hyperparameters  must 
be  estimated  from  the  observed  data  (y; }  in  the  target 
site.  Training  data  taken  from  other  settings  can  be 
used  to  determine  the  conditional  distribution  of  {y,} 
for  known  {0;},  i.e.,  the  distribution  (1),  but  different 
hyperparameters,  say  d,  would  prevail  at  the  training 
site,  so  training  data  could  be  used  to  estimate  a  only 
if  the  unexpected  assumption  &  =  a  held. 

2.  Results  for  the  Inferential  Model:  The  Dis¬ 
criminant  Function  Approximation  and  Identity. 

Development  of  the  inferential  model  proceeds  as  in 
MHR85,  but  with  a  more  useful  representation  of  the 
discriminant  function.  The  model  (1),(2)  leads  to  a 
very  complicated  exact  form  for  each  marginal  poste¬ 
rior  probability,  given  o,  P(6,  =  1  |  y,a).  However,  a 
good  approximation  to  this  probability,  with  accuracy 
improving  as  6  — *  0,  is  of  logistic  form.  We  go  further 
to  approximate  the  logistic  function  by  the  discriminant 
function,  which  effectively  predicts  P(0,  =  1  j  data) 
from  the  “ring”  averages,  these  being  averages  of  those 
data  values  in  specified  locations  (rings)  relative  to  pixel 
i,  as  in  Figure  1. 


and  the  mean  y  provide  estimates  of  the  hyperparame¬ 
ters.  Let  ir  =  .5  +  y/6.  Then 

(2.5)  Egy  -  6(8  —  .5)  ,  Egi r  =  0  and 


i 
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[  Figure  1:  Ring  locations,  Rq,Ri,  . . . ,  Rr,  centered 

with  pixel  i  at  ring  fio.  The  four  nearest  points, 
marked  1,  are  R j,  the  next  four  R2,  and  so  on.  Pairs 
I  (t,  u)  indicate  similar  correlation  structure,  with  com¬ 

mon  values  pt,u  if  (<,  u)  =  (±<,  ±u)  =  (±u,  ±i). 
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Define  the  n  x  (r  -f  1)  matrix  X  of  “regressors”  to 
have  the  itk  row  element  and  t,h  column  element  as  the 
average  of  measurements  at  ring  t,  Rt,  t  =  0,1,...  . 
Thus,  with  i  =  ( j,k ),  x,0  =  y,,  is  the  regressor  for  ring 
0, 

(2.1)  i,i  s  (yk,t+i  +  yk-i,i  +  yi,/-i,yt+i,<)/4  , 

is  the  regressor  for  R\,  and  so  on,  as  in  Fig.l.  Then 
modify  X  so  that  all  column  totals  add  to  zero,  by  sub¬ 
tracting  the  column  average  from  each  column. 

If  all  the  parameters  {8i}  were  known,  we  would  cal¬ 
culate  the  discriminant  function  as  follows.  Let  C(8)  = 
X'0/n,  0  =  Denote 

(2.2)  R]  =  nC\8)(X'X)-x  C(8)/8(  1  -  8) 

as  the  multiple  correlation  coefficient  between  8  and  the 
columns  of  X ,  and 

(2.3)  RSS(0)  =  0(1  -0)(1  -R]) 


as  the  residual  sum  of  squares. 

The  discriminant  analysis  formula  has  the  form 


(2.4) 


A,= 


P(8,  =  l\y)\ 
P(0>  =  0  |  y)J 


=  log(0/(l  -  0 ))  +  x',(X'X)~1C(0) 


-(0-  .5) 


This  form  is  equivalent  to  that  of  MHJ85,  but  is  in  more 
useful  form  because  the  quantities  involved  are  directly 
related  to  standard  linear  regression  output. 

The  sample  autocovariances  ct,  t  =  1,2, . . . ,  r,  being 


Cl  = 


(2.6)  E(6n(  1  -  7r)  ,  iCl . |cr)'  =  EC(0)  . 

Thus  8  is  estimated  by  7r  and  C(0)  by 

C  =  (6fr(l  -  ?r)  ,  Ci/6,. .  ,,cr/6)'  . 


Therefore  (2.4)  may  be  estimated,  even  when  0  is  un¬ 
known. 

We  now  can  specify  the 
INFERENTIAL  MODEL: 

(1')  The  margined  distribution  of  the  data  y  satisfies 
(2.5)— (2.6);  and, 

(2')  the  posterior  probability  P(8,  =  1  |  y,  o)  follows 
approximately  the  form  (2.4). 

3.  Empirical  Bayes  Estimation:  Estimating  Dis¬ 
criminant  Function  Parameters  from  Remotely 
Sensed  Data. 

Empirical  Bayes  modeling  stops  short  of  specifying 
a  unique  method  for  approximating  the  function  (2'), 
but  (2.5)  and  (2.6)  provide  obvious  approaches.  The 
simplest  estimate  of  8  is  ir  defined  by  (2.5),  since  ir  is 
unbiased  in  the  empirical  Bayes  sense,  E(it  —  0)  =  0  for 
all  joint  distributions  considered  in  (l)-(2),  that  is,  for 
all  a.  The  approximately  unbiased  estimate  C  in  (2.6) 
of  C(0)  was  used  in  MHR85  to  replace  C(6)  in  (2.4), 
and  in  the  Rg  and  RSS(<?)  formulae  (2.2), (2. 3).  This 
works  well  when  C  has  small  variance.  The  estimator 
of  MHR85  results: 


(3.1) 


with 


(3.2) 


A,  =  iog(*/(i  -  *))  +  *5(-v'A :rie 


-  (ir  -  .5) 


R2 

1  -R2 


RSSo  s  jr(l  -  7r)(l  -  R2)  , 

R2  =  nC'(X'X)~lC/v(l  -  ir)  . 


The  estimator  (3.1)  was  shown  to  fare  well  in  a  variety  of 
settings,  compared  with  the  “ideal”  estimator  A,  (which 
is  not  available  in  practical  problems). 

Still,  improvements  to  (3.1)  may  be  necessary  in  cer¬ 
tain  cases:  because  A*  depends  on  the  data  being  used 
to  estimate  C(6),  so  that  one  is  not  guaranteed  that 

(3.3)  Ex\(X' X)~xC  =  ^(.Y'AT’CW  • 


for  instance;  and  because  A,  is  non-linearlv  dependent 
on  C(0),  so  that  variability  of  any  nearly  unbiased  es¬ 
timator  C  will  cause  bias  in  non-linear  estimation  of 

C(8). 


Recognizing  that  pi  —  exp(A,)/(l  +exp(A,))  estimates 
the  posterior  probability  pi  =  E(8,  |  data),  since  such 
calculations  were  used  to  justify  the  logistic  form  of  pi, 
the  data-dependency  objection  can  be  handled  by  re¬ 
placing  8j  by  pi  in  C(6),  still  using  i  for  8  (experience 
shows  that  7r  is  very  close  to  Y1  P>/n)-  This  suggests  an 
iterative  rule: 

(a)  Calculate  Pi,  1  <  i  <  k  using  (3.1); 

(b)  Replace  8  with  v  and  C(8)  with  (pi, . . .  ,p„)'  in 

(2.4)  to  get  Xi\ 

(c)  Compute  pi  —  {  1  +  exp(— A*))-1 ;  and, 

(d)  Return  to  (b),  using  (p,, . . .  ,pn). 

Initial  experience  with  this  rule  has  resulted  in  more 
stable  relative  values  of  the  regression  coefficients  0  = 
(X' X)~l X'8,  but  their  absolute  values  are  too  large 
after  several  iterations.  This  could  be  due  to  over  esti¬ 
mation  of  the  convex  function  1/RSS($). 

The  problem  of  quadratic  dependence  of  R\  oil  8  in 
(2.2)  can  be  easily  handled  by  methods  conditional  on 
the  data,  since,  with  expectation  conditional  on  X,  and 
p~E8\X, 

(3.4) 

E8'(X'X)~l8  =  tr  ((X1  X)*1  E88') 

=  p'(A"X)-1p  +  tr  ((X'Xy'Zx)  . 

To  implement  this,  however,  the  currently  unavailable 
posterior  probabilities  of  pairwise  ocurrences  of  8,  and 
8j  also  are  needed  to  compute  Exi  the  conditional  co- 
variance  matrix  of  the  8  vector. 

4.  Other  Uses  of  the  Spatial  Logistic  Estimator: 
Detection  of  Edges,  Corners,  and  Shapes. 

The  technology  of  Section  2  also  can  be  used  for  the 
purposes  of  determining  the  probabilities  that  edges  of 
shapes,  or  even  part  of  a  particular  shape,  exist  at  a  lo¬ 
cation.  For  edge  and  corner  detection,  it  is  convenient 
to  shift  the  entire  rectangular  lattice  up  and  sideways 
one-half  pixel,  so  that  points  for  this  form  of  detection 
are  relocated  at  the  original  pixel  boundaries  and  cor¬ 
ners  rather  than  at  pixel  centers.  Then  the  matrix  X  of 

(2.4)  is  specified  not  as  in  (2.1)  and  the  surrounding  dis¬ 
cussion,  but  with  other  codings  sensitive  to  boundaries. 
For  example,  a  “signed  horizontal  edge  detector” ,  as  in 
Fig.2,  when  placed  at  location  t,  adds  the  12  y-values 
above  and  subtracts  the  12  y-values  below,  the  sum  pro¬ 
ducing  the  value  x,  at  location  i  =  (j,  k).  Note  that  x, 
has  an  expected  value  of  12 6  at  locations  for  which  all 
12  pixels  above  are  of  type  1  and  all  12  pixels  below 
are  type  0,  still  assuming  y,  |  8,  ~  N(6(8j  —  .5),  1).  Of 
course,  —12 8  is  obtained  if  all  24  pixels  are  reversed,  and 
values  between  —126  and  126  result  in  more  scattered 
situations.  Ex{  =  0  in  the  middle  of  a  large  homoge¬ 
nous  shape.  Large  values  of  |xj|  indicate  the  presence 
of  a  horizontal  edge,  but  without  suggesting  whether 
8i  =  1  to  the  north  or  the  south. 


Figure  2:  Assignments  of  numerical  weights  for  a 
horizontal  4x6  edge  detector  placed  at  the  center  of 
24  pixels. 
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Similarly,  vertical  edge  detectors  would  correspond 
to  turning  Fig.2  on  its  side,  and  two  possible  corner 
detectors  might  be  as  in  Fig.3  (these  types  would  work 
best  in  a  checkerboard  setting).  The  first  detector  would 
be  particularly  sensitive  to  the  meeting  of  two  corners, 
the  second  one  to  the  southwestern  corner  of  a  figure 
extending  into  a  homogeneous  area. 

Figure  3:  Corner  detectors  for  a  rectangular  grid. 


Variables  x'  and  x"  might  be  coded  as  indicated  by 
Fig.3  for  each  position  i  =  (j,  k),  with  the  data  values 
{y<}  assigned  weights  according  to  the  values  in  the  de¬ 
tectors  (and  zero  outside  the  detector).  Thus  xj  would 
be  the  sum  of  values  of  the  eight  nearest  pixels  to  the 
northeast  and  southwest  minus  the  sum  of  the  other 
eight  nearest  values. 

Detectors  for  other,  more  general  shapes,  could  be 
set  up  in  an  analogous  fashion,  assigning  positive  values 
to  data  in  locations  where  the  shape  would  exist,  and 
negative  values  elsewhere. 

The  true  values  8i  again  must  be  defined  as  binary 
values,  if  the  methods  of  Section  2  are  to  apply.  For 
example,  we  might  have  0,  =  1  if  a  horizontal  or  ver¬ 
tical  boundary  exists  at  location  i,  otherwise  8,  =  0. 
Then  the  X  matrix  might  have  two  variables,  row  t  be¬ 
ing  (1,  |x,j,  |x<|, . . .),  x,  from  the  edge  detector  of  Fig.2 
and  £i  as  a  signed  vertical  edge  detector.  Formula 

(2.4)  again  is  available,  providing  an  estimate  of  P(8i  = 
1  |  data)  via  the  discriminant  function.  However,  es¬ 
timates  of  8  and  C(8)  are  required,  and  will  take  a 
different  form  than  given  in  Section  3.  If  such  values 
are  available,  these  methods  can  be  used  in  conjunction 
with  estimates  of  the  probabilities  of  each  classification 
as  given  in  Section  3  to  give  more  accurate  estimates  at 
the  borders  of  those  regions  having  significant  sizes. 
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5.  Conclusions. 

Empirical  Bayes  methods,  as  reported  in  MK83, 
HHKM84  and  MHRS5,  provide  a  helpful  perspective 
from  which  one  can  view  the  remote  sensing,  image 
restoration,  and  other  spatial  problems.  Lessons  in¬ 
clude: 

(i)  Spatial  correlation  can  be  modeled  as  occurring  en¬ 
tirely  in  the  ground  truth  process  {#,},  or  jointly 
in  ground  truth  and  in  the  observations  {t/,}.  The 
conditional  covariances  of  ( y ,  yy )  given  8  would  be 
derived  from  training  data. 

(ii)  Training  data  are  inadequate  for  estimating  the  hy¬ 
perparameters  a.  Their  proper  use  is  to  determine 
the  likelihood  function  ( e.g .,  “Badhwar  numbers”, 
“greenness” ,  and  “brightness”  formulas  derived  for 
use  in  LANDSAT  satellite  data  applications). 

(iii)  Remotely  sensed  data  may  be  used  to  estimate  the 
proper  Bayes  rules  and  thus  are  used  for  dual  pur¬ 
poses  of  hyperparameters  {a}  estimation  and 
ground  truth  {6, }  estimation. 

The  celebrated  work  of  D.  Geman  and  S.  Geman 
(1984)  falls  within  the  empirical  Bayes  paradigm  in  that 
the  hyperparameters  they  specify  are  estimated  from 
the  marginal  distribution  of  the  data,  rather  than  being 
arbitrarily  chosen.  While  their  results  are  more  general 
than  those  here,  we  hope  eventually  to  extend  further 
the  approach  in  this  paper  to  include  the  polytomous 
case  (several  possible  values)  for  <?;,  non-normal  mul¬ 
tivariate  distributions  for  y;  (the  likelihood  ratio  then 
plays  a  central  role),  and  dependencies  in  the  distribu¬ 
tion  of  the  observations,  given  the  ground  truth.  This 
paper  indicates  some  extensions  of  MHJ85  by  suggesting 
improvements  in  estimation  of  hyperparameters,  and  in 
expanding  the  role  of  the  technique  to  include  detection 
of  edges  and  shapes.  Continuing  with  this  approach  not 
only  will  provide  further  insights,  but  will  provide  com¬ 
putationally  quicker  methods  than  the  computer  inten¬ 
sive  techniques  necessary  to  estimate  posterior  modes. 
This  will  allow  more  data  and  larger  regions  to  be  ana¬ 
lyzed. 
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Partial  and  Interaction  Spline  Models  for  the  Semiparametric  Estimation  of 
Functions  of  Several  Variables. 


Grace  Wahba,  University  of  Wisconsin-Madison 


A  partial  spline  model  is  a  model  for  a  response  as  a  func¬ 
tion  of  several  variables,  which  is  the  sum  of  a  "smooth" 
function  of  sevaral  variables  and  a  parametric  function  of 
the  same  plus  possibly  some  other  variables.  Partial  spline 
models  in  one  and  several  variables,  with  direct  and 
indirect  data,  with  Gaussian  errors  and  as  an  extension  of 
GLIM  to  partially  penalized  GLIM  models  are  described. 
Application  to  the  modelling  of  change  of  regime  in  several 
variables  is  described.  Interaction  splines  are  introduced 
and  described  and  their  potential  use  for  modelling  non¬ 
linear  interactions  between  variables  by  semiparametric 
methods  is  noted.  Reference  is  made  to  recent  work  in 
efficient  computational  methods. 

1.  Introduction 

Partial  spline  models  have  proved  to  be  interesting 
both  from  a  practical  and  a  theoretical  point  of  view,  partly 
because  of  their  dual  nature  both  as  solutions  to  certain 
intuitively  reasonable  variational  problems,  and  as  Bayes 
estimates  with  certain  parsimonious  priors.  In  these 
proceedings  we  will  attempt  to  give  a  quick  rundown  con¬ 
cerning  some  of  their  more  interesting  manifestations,  and 
to  report  briefly  on  two  new  developments,  first,  the  use  of 
partial  spline  models  to  describe  discontinuities  or  changes 
of  regime,  in  two,  three  and  higher  dimensions,  and, 
second,  the  idea  of  interaction  splines  for  use  in  studying 
nonlinear  interactions  between  variables  semiparametri- 
cally. 

2.  Partial  spline  models  -  one  splined  variable 

A  response  as  a  function  of  the  variables  x  ,z  j, . . . ,  zk 
is  modelled  as 

yi  =/(*(i))+  £e/'Py(x(«);z  (i))  +  e,  (2.1a) 
j=i 

where 

*(«)  =  (*  i(0,-, **(«))  (2.1b) 

the  Vj’ s  are  given  parametric  functions  and  the  e,’s  are 

independent,  zero  mean  Gaussian  random  variables  with 
common  (unknown)  variance.  The  estimate  (/  x,0x),  where 
0X.=(0U' ■  •  • -Spk)*  *s  f°und  as  the  minimizer,  in  an 
appropriate  space,  of 

—  ZCy/ -/(*(«))-  £0;t/,(-r(O;z(O))2  + 

n,=i  ,=i 

Vm(f)  (2.2a) 

where 

i 

Jm(f)  =  j(fim\x))2dx.  (2.2b) 

o 

We  have  the  following 


Theorem:  (Kimeldorf  and  Wahba  (1971)  -  KW  )  Let 
fl>lv  ,<t>m  span  the  null  space  of  Jm.  If  the  design  matrix 
for  least  squares  regression  on  span 

<1>1 . . 'Vp  is  of  full  column  rank,  then  there 

exists  a  unique  minimizer  (f  x,0x)  for  any  X>0  ,  and  f  \  is  a 
polynomial  spline  function. 

The  parameter  X  as  well  as  m  can  be  choosen  by  gen¬ 
eralized  cross  validation  (GCV). 

The  appropriate  function  space  here  is  the  Sobolev 
space  W J,  however,  Jm  (and  W”)  can  be  replaced  by  any 
seminorm  in  a  reproducing  kernel  (r.  k.)  Hilbert  space  of 
real  valued  functions  on  [0,1]  provided  that  least  squares 
regression  onto  the  span  of  the  null  space  of  the  seminorm 
is  well  defined  -  you  get  a  Bayes  estimate  with  the  r.  k. 
related  to  the  prior  covariance.  Details  may  be  found  in 
KW  and  Wahba  (1978)  but  we  will  not  discuss  the  Baye¬ 
sian  aspect  any  further,  other  than  to  note  that  the  prior 
behind  Jm  is  the  most  parsimonious  member  of  a  large 
class  of  equivalent  priors. 

Partial  spline  models  with  one  splined  variable  were 
introduced  by  several  authors  in  different  contexts,  with 
some  interesting  applications,  see  Anderson  and  Senthilsel- 
van  (1982),  Engle  et  al.  (1983),  Green,  Jennison,  and 
Seheult  (1983),  Shiller  (1984). 

3.  Partial  Spline  Models  -  Several  Splined  Variables 

Now,  let  the  model  be 

y,  =/(•*(»))  +  Z0,*F,(*(  t);z(f»  +  e,  p.ia) 

j= i 

where 

x  =  (x |, . . .  ,xj)  ,  x(i )  =  (x [(i ) . xd(i :)).  (3.1b) 

Again,  we  find  /  in  an  appropriate  space  to  minimize 

-ICv,  -/(*(«'»-  I6.lP  (x(/);z(i)))2  + 

”i=i  y=i 

Vm[f)  (3.2) 

where  now,  we  can  use  the  "thin  plate  spline"  penalty 
functional.  For  d=2,m  =2,  it  is 


Jm(f)=j  £/*%.  + 2/^, +  //,,,, 

and  for  arbitrary  d  it  is 

ai+.+Gu=/n  I 


J-J 


y/ 


dx  j  ■  •  ■  dxd 


(3.3) 


(3.4) 


II 


provided  2 m>d.  The  null  space  of  Jm  is  the  span  of  the 
M  =  j  monomials  of  total  degree  less  than  m ,  call 

them  Oj, . , .  ,<PM.  Again,  there  will  be  a  unique  minim- 
izer  (f  x,0x)  for  every  nonnegative  X  if  the  design  matrix 
for  least  squares  regression  on  «D1# . . .  . . . .'Vp  is 

of  full  column  rank,  and  /  ^  is  a  thin  plate  spline  function. 

Partial  splines  with  several  splined  variables  were 
introduced  in  Wahba  (1984a),  Wahba  (1984b),  Wahba 
(1985),  and  a  discrete  version  has  been  proposed  by  Green, 
Jennison,  and  Seheult  (1986).  Transportable  code 
(GCVPACK,  Bates  et  al.  (November  1985))  is  available  for 
fitting  the  partial  spline  models  of  (3.1)-(3.4)  and  comput¬ 
ing  the  GCV  estimate  X  of  X.  This  code  does  well  with  up 
to  around  400  data  points  on  the  VAX  11/750  in  the  Statis¬ 
tics  Department  at  Madison.  The  work  primarily  depends 
on  n  ,  and  not  d ,  but,  of  course  good  estimates  with  large  d 
will  require  large  n.  Diagnostics  for  splines  (without  the 
"partial"  part)  have  been  developed  by  Eubank  (1986),  it 
can  be  anticipated  that  this  work  will  extend  to  partial 
spline  models. 

4.  Indirect  measurements 


g(x;z)=/(x)  +  EOyT'^z), 


and  now  let 


y  i=Lig+ti  (4.2) 

where  L,  is  a  bounded  linear  functional,  for  example: 

LJ  =  jw,  (x ;  z  )g  (x ;  z  )ndx  ndz.  (4.3) 

This  kind  of  data  comes  up  in  X-ray  tomography,  satellite 
tomography,  stereology,  and  in  other  remote  or  indirect 
sensing  problems  in  the  physical  and  biological  sciences. 
One  finds/  and  0  to  minimize: 


-  LJ  ~  Yfijh'VjY  +  XJm(f).  (4.4) 


The  use  of  variants  of  (4.3),  and  (4.4)  may  also  provide  a 
good  way  to  deal  with  heterogeneous  aggregated  economic 
data.  For  an  application  in  stereology,  see  Nychka  et  al. 
(1984). 

Data  involving  mildly  nonlinear  functionals  can  be 
accomodated  -  then 


y,  =A hg  +e, 


Ni8  =  j  lwi(*  z  ))Kdx  ndz.  (4.5b) 


One  finds /  and  0  to  minimize 


■^T(y,  -v,(/+x9,y,>)2  +  ^m(/).  (4.6) 


The  minimization  can  be  performed  using  basis  functions 
and  a  Gauss-Newton  iteration  and  X  chosen  by  GCV  for 
nonlinear  problems,  see  O’Sullivan  and  Wahba  (1985). 


5.  Non  Gaussian  errors  (semiparametric  penalized  GLIM 
models) 


g(x,z)=f(x)  +  ^J'i'J(x-,z) 


y>  ~Fg- 

For  example  : 

yi  ' Poisson  with  A,-  =  riO), 


y-t  ' Binomial  with  p,/(l- p-t)  =  67); *(«')), 

etc.  Here,  one  finds  /  ^,0^  to  minimize 


Lif  ,0)  +  XIm(f ) 


where  L  is  the  log  likelihood.  O’Sullivan  (1983)  and 
O’Sullivan,  Yandell,  and  Raynor  (1986)  proposed  numeri¬ 
cal  methods  and  a  GCV  for  penalized  GLIM  models.  See 
also  Green  and  Yandell  (1985),  Silverman  (1982),  Cox  and 
O’Sullivan  (October,  1985),  Leonard  (1982).  Further  work 
on  numerical  methods  for  penalized  GLIM  and  nonlinear 
indirect  sensing  problems  is  reported  in  this  proceedings 
by  Yandell. 

6.  Use  of  partial  splines  to  model  functions  which  are 
smooth  except  for  specified  discontinuities 


Let  d=l  and  let 


g(x;z)=/(x)  +  0|x-x*| 

that  is,  'F  |  (x ;  z )  =  ]  x  -x*  | .  Then  the  partial  spline  estimate 
of  g  will  have  a  jump  in  the  first  derivative  at  x*  of  size 
20.  In  two  dimensions  we  may  use  a  partial  spline  to  model 
a  jump  in  the  first  derivative  with  respect  to  x2  along  a 
given  curve  X2»(xj):  Let 

7(x)  =  Y(x,S2)=  I x  2~x  2*  (*  0 1  > 


g(x;z)=/(x)  +  0(x1)y(x) 
where  0  may  depend  on  x  j.  Then 


_  ik_ 


2  J*  !=*:<*  i)*- 


2  J*2=*l(*l)*. 


=  20(x  |). 


If,  for  example 


0(*i)=  L9,V,(*i) 
j= i 


where  the  qj ’s  are  given,  then 


H*;(x;  z)  =  <?,(x,)Y(x)- 

This  fits  right  into  the  partial  spline  setup,  and  GCVPACK 
may  be  used  to  compute  the  estimate.  A  generalization  to 
d= 3  with  a  jump  in  the  first  derivative  with  respect  to  xs 


w 


wv 
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along  a  surface  x3*(x1,X2)  *s  straigtforward.  For  details, 
and  a  description  of  an  application  to  the  three  dimen¬ 
sional  modelling  of  the  tropopause  in  the  atmosphere  and 
the  thermocline  in  the  ocean,  see  Shiau,  Wahba,  and  John¬ 
son  (Dec.  1985). 

7.  Linear  inequality  constraints 

Expressions  (2.2),  (3.2),  (4.4),  etc.  can  be  minimized 
subject  to  finite  families  of  linear  inequality  constraints. 
See  Villalobos  and  Wahba  (March  1985). 

8.  Main  effects  and  interaction  splines 

The  thin  plate  spline  is  defined  on  Euclidean  d  space 
for  any  d  with  2m -d  >0,  provided  there  are  enough  data 
points  for  mth  degree  polynomial  regression,  but  unless 
there  are  very  large  data  sets,  in  many  applications  will  be 
desireable  to  reduce  the  amount  of  structure  involved. 
Several  authors  have  suggested  modelling  /  as  a  linear 
combination  of  functions  of  one  variable,  that  is, 

i 

/(*)=/ o+  S/o(*a). 

d/= 1 
1 

where  x  =  (x  j, . . .  ,xd)  ,  and  ]f  Jjc^dx 0  =  0.  (Note  the 

o 

switch  to  the  unit  cube.)  See  Friedman,  Grosse,  and  Stuet- 
zle  (1983),  Stone  (1985),  Burman  (June,  1985).  We  have 
been  working  on  generalizations  of  this  idea,  whereby  /  is 
modelled  successively  as  linear  combinations  of  functions 
of  one  variable,  functions  of  one  and  two  variables,  func¬ 
tions  of  one,  two  and  three  variables,  etc.  The  resulting 
estimates  may  be  called  main  effects  splines,  first  order 
interaction  splines,  second  order  interaction  splines,  etc.,  by 
analogy  with  analysis  of  variance.  We  consider  here  two 
quite  different  but  interesting  penalty  functionals  which  we 
will  refer  to  as  TEPR  (for  "tensor  product"),  and  THPL  (for 
"thin  plate").  We  will  briefly  sketch  some  early  results  of 
some  work  in  progress,  by  describing  the  simplest  exam¬ 
ples. 

The  main  ideas  are  most  easily  explained  by  first  con¬ 
sidering  only  spaces  of  periodic  functions  on  the  unit  d- 
dimensional  hypercube,  that  satisfy  certain  linear  equality 
or  boundary  condtions,  and  then  removing  these  condi¬ 
tions.  Let  $v(x;)  =  cos2itvx;  or  sin2jtvx;  (with  some  abuse 
of  notation),  and  let  eo=l,0v=2jrv,v  >  0,  and  let  HffPR  and 
HfffpL  be,  respectively,  the  collections  of  all  functions  /  of 
the  form 

/(*!»  •  •  •  .**)  = 

M 

X  cv,  ■  ••  v^v,(*  t)  ’  ’  '  §vd(x<l)  (8.1) 

V| . vM 


X  [®v,  •  •  1  0vj2m<:v1  •  •  ■  v,  <  H^EPR  (8.2) 
V| . v« 


I  [®v,  +  ■  •  •  +  K  )m<  -  V.  <  -  Hffipn 8.3) 

v> . V,=0 

It  can  be  shown  that  HfgpR  will  be  a  reproducing  kernel 
hilbert  space  with  (8.2)  as  squared  norm  for  any  m  >  1/2, 
and  Hff[PL  will  be  a  reproducing  kernel  space  with  the 
squared  norm  (8.3)  for  any  m  >  d/2.  These  spaces  are  not 
equivalent,  and  reflect  different  ideas  of  what  is  "smooth". 
However,  each  can  be  written  as  the  direct  sum  of  2d 
orthogonal  subspaces,  namely,  H0,  the  j^J  "main  effects" 
subspaces  of  the  form 

Ha  =  span  ftv,(x0),v0=lA...;  a=l 

the  I2  j  first  order  interaction  spaces  of  the  form 

Hap  =  span  (4>V(i(xa)4>Vfl(*p),  va,  vp>0j,  l<a<P<d, 
and  so  on. 

Letting 

1  1 

=  1 . ^>11^  cJ2.  (8.4) 

0  0  a 

the  squared  norm  (8.2)  on  Hfffpp  can  be  shown  to  be  equal 
(in  Hfppp)  to 

J„(f)+JTHPL(f),  (8.5) 


j™PL(f)=  L  ,m;  ,x 

ai+...+a*  ai  •  • 


0  0  a»-  a> 


dx  ■  ■  dxd 


0\dx?--dx? 


is  the  thin  plate  penalty  functional. 

For  lack  of  space  we  will  not  discuss  the  thin  plate 
spaces  further,  but  analyses  similar  to  but  slightly  more 
complicated  than  those  below  can  be  carried  out.  In  what 
follows,  we  will  only  consider  the  tensor  product  case  and 
sub  or  superscripts  TEPR  are  to  be  understood. 


1  1  1 

•/a(/)  =  J<&jJ-JTtn^pl2  (8.7a) 

0  0  0  OXa 


1  1  1 

rf  tOt 


Then  the  squared  norm  (8. 1 )  on  Hfppp  can  be  shown  to  be 
equal  to 


1 

1 


rlgl 

ri 

i 

i 

© 


r 


J*V)  +  T,Ja<f)+  I J  <$(/)+  ■■■  +J  (8.8) 

ot=l  a<3 

As  an  example,  we  will  consider  below  /  tHfpPR  which 
consists  only  of  a  mean,  all  d  main  effects  and  the  first 
order  interaction  between  x  [andtj.  Thus  /  is  of  the  form 

d 

/(*  1 . *d)=f  0+  I/a(*a)+/l2(*l>*2).  (8.9) 

a=l 

where  /  0  is  a  constant,  /  azH  a,  and  / 12^  12-  We  can 
now  define  the  periodic  interaction  smoothing  spline  as  that 
function  /  x  of  the  form  (8.9)  which  minimizes 

-  £  O',  -/  C*  (« ))2  +  M  £  J a(/a)  +  J 12 (f  12)1.(8.10) 

n j. 1  a=l 

where  x  (i  )=(x ,(/  )). 

Using  Lemma  5.1  in  KW  it  can  be  shown  that  there  is  a 
unique  minimizer  of  (8.10)  in  ot&H  12-  An  expli- 

a 

cit  representation  for  it  may  be  found  using  this  lemma  and 
the  fact  that  the  reproducing  kernel  K(x,z)  for  '£/! a@H  12 

a 

is  given  by 

K(x,z)  =  (x  a*2  a)  +  Bm(xl’zi  )Bm  (x  2>2  2)(8 .  H  a) 

a 

where 

£  6v  2”[cos2juvj  cos2ttvr  +  sin2nvs  sin2revt  ]<8.  lib) 

V  =  1 

A  closed  form  expression  for  Bm  may  be  found  in  Craven 
and  Wahba  (1979).  GCVPACK  may  be  used  to  compute 
/  £.  In  principle,  jy  a(f  x)  can  be  replaced  by  '£waJa(J  *), 

a  a 

where  the  wa  are  positive  weights,  but  problems  concern¬ 
ing  their  estimation  from  the  data  have  not  been  studied  to 
date. 

We  will  now  sketch  how  to  remove  the  rather  restric¬ 
tive  periodicity  conditions  from  Hf"pR.  For  g  a  function  of 
one  variable,  let 

1 

*-0 g=jg(u)du  (8.12a) 


Lvg  =jg(vku)du  =g<v-1)(l)  -g(v-1)(0),  (8.12b) 
0 

and  let  LV(*„/  mean  applied  to  /  as  a  function  of  xa. 
Then  =  0  for  v=0,l,...,m,  a=l,2,...,d,  any  /  in 

Hffrpp.  Now,  it  can  be  shown  that  H  a  is  that  subspace  of 
the  Sobolev  space 

W? [0,1]  =  (g :  g , g' corn. 
of  co-dimension  m+ 1  which  satisfies  the  m+1  conditions 


Lvg= 0,  v=0,l,  ...,m.  Let  *v  =  — ,  v=0,l, . . .  ,m,  where 

the  bv  are  the  Bernoulli  polynimials,  we  have 
Lvk)l  =  0,ii*vj.vky  =  l,p,v=0,l, . . .  ,m,  and  thus  kv  is 

not  in  Ha.  Let  W°  =  span  {k  0 . km_ ■J  and  let  IV 1  be 

isomorphic  to  H $(kn}.  Then  it  can  be  shown  that  IV" 
endowed  with  the  inner  product 

1 

<g,h>WJ  =  '£LygLvh  +  jg^(u)h^\u)du  (8.13) 


satisfies 


W?  =W°G>  IV1 


(8.14) 


Letting  gtW%  with  g  =  go  +  «i>£o£M'o.SiEW,i  we 
can  call  g  the  polynomial  part  of  g ,  and  g !  the  "smooth" 
part.  Now  let 

hthpl  =  w2®  ■  •  ©V?  d  times  (8.15) 
=  (W°©W1)®  •  •  ®(W°©W1). 

=(n<)©(£w,d®nw'p°)«> 

0=1  01=1 

(IVVd(S»V^®n  €> 

a<P  r*a,P 

Ct=I 

where  the  Greek  subscripts  make  explicit  which  variables 
are  involved.  We  can  now  identify  the  "polynomial*'  sub¬ 
space 

«0=nM,a. 

0=1 


the  main  effects  subspaces 

d 

-  u/  L 


wa=»,d<snw'pu.a=i. 


,d , 


the  first  order  interaction  spaces 

w„p=iv,i^®n 


1f*a.P 


etc. 


The  induced  tensor  >roduct  inner  product  in  HTEPR  is 
a  natural  extension  of  the  inner  product  of  (8.7)  and  (8.8). 
Letting  J  a  be  the  induced  norm  on  Ha  ,  etc.,  we  can  now 
seek  /  \  in  the  new,  non  1  eriodic  version  of,  for  example 

H o&EJJ aG>H  12  ,0  minimize 
a 

/ Xx (* ))  +  a(/^a)  \i(f  12))-  (&16) 

i=l  a 

Existence  and  uniqueness  for  any  \  >  0  can  be  shown  via 
Lemma  5.1  in  KW  provided  the  design  points 
*(*).  i=l . n  are  such  that  least  squares  regression  in 
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Hq  is  unique.  The  reproducing  kernels  for  the  various  sub¬ 
spaces  then  follow:-  The  r.  k.’s  R 0  and  R  i  for  W0  and  W 1 
can  be  shown  to  be 

m- 1 

R  o(“  ,v )  =  I*v(«)*v(v). 

v=0 

R  i(«  .v )  =  km(u )km(v )  +  Bm(u ,v ), 

and  the  r.  k.  for  HTEPR  with  the  inner  product  induced  by 
(8.13)  is 

d 

nc*0(*  «*«)©*  l(*a>z  a))> 

a=l 

thus,  for  example  the  r.  k.  for  n  is  now 

a 

Q  (x,z )  =  i(-ra,z0)n/?o(-rp.zp) + 

a  P«i 

«l(jt1,Z1)^1(X2,Z2)  FI  &  o(x  p>z  p)- 

P*l,2 

Given  the  r.  k.  an  explicit  representation  for  /  ^  can  be 
given,  and,  again  GCVPACK  can  be  used  to  calculate  /  £. 
For  m=l,  Ro(xa,2g)  =  1,  H0  is  one  dimensional  as  before, 
and  we  only  replace  Bm  in  the  discussion  of  periodic 
spaces  by  R  i  and  the  same  expressions  hold.  For  m  >  1 ,  a 
typical  element  of  Ha  with,  say  a  =  1  is  now  of  the  form 

/ (*  i>  ■  ■  <*d)  ~ 

m-1 

S  f  ■  -vfa  l)^vl(Jf2)  *  (8.18) 

Vi . vw=0 

The  v2  =  •  •  •  =  \d  -  0  term  depends  only  on  x  i  but  the 
other  terms  do  depend  on  the  other  variables  albeit  in  a 
parametric  ( i.  e.  polynomial )  way.  The  case  m  =2  is  prob¬ 
ably  of  special  interest,  then  xp  with  P#a  enters  at  most 
linearly  in  functions  in  Ha. 

There  are  now  many  interesting  questions.  Some  of 
the  major  ones  are  -  the  development  of  good  methods  for 
choosing  which  interactions  to  include  (GCV?),  numerical 
methods  for  vary  large  data  sets,  methods  for  interpreting 
the  results,  development  of  confidence  intervals,  and  so  on. 
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The  theoretical  framework  of  the  Language  of 
Data  is  discussed  in  Dolby,  Clark,  and  Rogers 
(1986).  This  paper  describes  the  "language" 
half  of  the  theory  in  further  detail. 

Although  we  are  not  accustomed  to  thinking  of 
visual  displays  as  having  formal  language  proper¬ 
ties,  tables  and  graphs  follow  all  the  rules  of 
written  language,  from  the  larger  organizational 
structure  of  text  down  to  the  sentence  grammar  of 
ordinary  English.  In  tables  these  properties  are 
reflected  directly  in  the  tabular  form.  To  iden¬ 
tify  their  counterparts  in  graphs  it  is  necessary 
to  look  more  closely  at  the  common  structure  of 
visual  and  verbal  language. 

The  Medium  of  Communication  for  Data 

Data  differ  from  other  forms  of  information  in 
two  respects.  With  data  the  usual  activities  of 
information  gathering,  organization,  and  synthesis 
are  carried  out  independently  by  people  who  have 
little  or  no  direct  contact  with  each  other.  As 
a  result,  what  is  ordinarily  a  simple  sequence  of 
events  is,  for  data,  a  chain  of  communication. 

The  second  difference  is  the  focus  of  these 
activities.  With  written  information  the  entire 
process  is  directed  toward  synthesis  of  the  con¬ 
tents  into  final  form  for  dissemination  to  others, 
i  With  data,  however,  synthesis  does  not  lead  to  a 

1  single  end  product.  Analysis  is  a  multiple  activ- 

1  ity,  and  while  the  communication  chain  does  not 

end  with  the  analyst,  the  activities  are  directed 
primarily  toward  this  stage. 

Both  of  these  characteristics  imply  that,  for 
data,  the  issue  of  communication  arises  long 
before  the  presentation  stage.  The  full  communi¬ 
cation  chain  runs  from  data  collection,  through 
editing  and  revision,  and  storage  and  retrieval, 
to  analysis  and  presentation.  At  every  interface 
information  is  transferred  from  one  stage  to  the 
next  through  a  visual  intermediary;  "data"  is  not 
a  spoken  language.  Although  tables  and  graphs  are 
the  primary  conceptual  structures  for  data,  it  is 
their  visible  form  that  does  the  communicating. 

Because  the  communication  function  of  data  dis- 
'  plays  is  usually  treated  as  a  presentation  issue, 

the  problems  tend  to  accumulate  at  that  level. 

’  However,  some  of  the  issues  defined  in  Figure  1 

,  arise  as  early  as  the  data-collection  stage.  One 

i  is  uncertainty  about  whether  data  displays  are 

i  supposed  to  communicate  at  all — that  is,  whether 

i  tables  and  graphs  should  be  viewed  as  a  communica- 

(  tion  medium  or  as  storage  containers  for  data. 

I  All  tables  have  a  capacity  for  communication 

which  is  inherent  in  the  tabular  form.  For  exam- 
1  pie,  microdatabases  and  tables  both  have  an  infor- 

’  mation  structure  that  corresponds  directly  to  the 

,  organizational  structure  of  text.  A  microdatabase, 

l  however,  is  specifically  designed  for  information 

i  retrieval — the  selection  of  individual  variables, 

I  _ 

Language  of  Data  Project,  Box  R,  Sausalito,  CA 
94966.  Research  supported  by  a  grant  from  the 
System  Development  Foundation  to  San  Jose  State 
University. 
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or  subsets  of  variables,  to  be  related  in  some  I 

context  external  to  the  database.  Although  any  I 

table  can  be  used  for  this  purpose,  a  table  also  ’ 

has  a  communication  structure;  the  limits  of  co-  * 

herence  require  that  the  columns  of  a  table  have  , 

some  logical  relationship  to  each  other.  Thus  for 

tables  the  design  decision  relates  to  primary  and 

secondary  functions,  with  the  levels  of  access  for 

each  use  organized  accordingly.  Archival  tables 

are  usually  organized  for  access  first  at  the 

retrieval  level,  and  next  at  the  communication 

level. 

The  difference  in  design  objectives  is  more 
apparent  in  graphs.  One  type  of  graph  designed 
explicitly  for  data  storage  is  the  standard  heat- 
transfer  diagram.  However,  the  very  qualities 
that  make  such  graphs  useful  for  their  intended 
purpose  make  them  correspondingly  unuseful  as 

communication  devices.  Although  the  most  common  i 

problem  is  an  effort  to  use  a  graph  designed  for 

one  function  for  an  entirely  different  function, 

the  concept  that  information  is  stored  in  the 

data  leads,  at  a  deeper  level,  to  a  confusion  of 

efficient  storage  with  efficient  communication. 

It  also  leads  to  the  more  important  question  of 
where  the  information  does  lie  in  a  display. 

The  distinction  between  analysis  and  presenta¬ 
tion,  the  second  branch  of  the  diagram,  hinges  on 
an  equally  fundamental  issue.  The  most  obvious 
difference  is  the  difference  in  audience — self- 
communication  versus  communication  to  others.  For 
the  author  of  the  communication,  however,  the  two 
functions  are  related  only  by  the  fact  that  one 


FIGURE  Basic  functions  of  data  displays 
From  Clark,  Statistical  Presentation — of  What,  to 
Whom,  and  for  Which  Purpose  (1983) 
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often  (but  not  always)  follows  the  other.  Analy¬ 
sis  is  the  process  that  generates  the  content  to 
be  communicated;  its  transmission  to  others  is 
essentially  a  writing  problem. 

Of  course,  analysis  also  takes  place  on  the 
reader's  side  of  the  page,  often  so  immediately 
that  the  transition  from  reader  to  analyst  is  too 
fleeting  to  grasp.  As  a  result,  the  distinction 
between  comprehension  and  analysis  is  similarly 
blurred,  and  the  information  appears  to  communi¬ 
cate  itself  without  the  aid  of  any  vehicle. 

Although  the  further  distinctions  at  the  bottom 
of  Figure  1  relate  to  presentation,  some  of  them 
might  also  apply  in  other  branches  of  the  diagram. 
(The  three  branches  of  graphic  design  are  described 
in  Clark,  1983).  For  example,  the  three  purposes 
of  presentation  narrow  the  communication  goal  at 
all  preceding  stages  to  the  communication  of  infor¬ 
mation,  and  this  is  in  fact  the  goal  to  which  the 
Language  of  Data  is  limited. 

Some  of  the  methodology  of  information  graphics 
may  also  apply  to  the  design  of  analytic  tools. 

For  example,  perceptual  persistence  is  utilized  in 
textbook  design  to  relegate  successive  levels  of 
information  to  context  as  the  ideas  are  absorbed, 
a  process  not  unlike  that  in  analysis,  where  dis¬ 
covery  results  from  the  accumulation  of  insight 
gained  through  successive  views  of  the  data. 

Unfortunately  the  advantage  of  perceptual  per¬ 
sistence  during  analysis  backfires  at  the  communi¬ 
cation  stage.  We  all  have  trouble  seeing  our  own 
material  from  a  state  of  innocence,  but  visual 
representations  are  particularly  susceptible  to 
the  "Eureka  problem."  Once  a  particular  meaning 
has  been  discovered,  it  seems  to  leap  out  of  the 
page  in  almost  any  view  of  the  data.  As  a  result, 
the  author  of  a  visual  display  tends  to  read  into 
it  what  he  or  she  meant  to  show  and  assume  that 
the  table  or  graph  actually  communicates  this 
information. 

What  the  reader  sees,  of  course,  is  the  meaning 
(if  any)  to  which  he  or  she  is  visually  directed 
by  the  table  or  graph.  Since  this  last  condition 
also  holds  for  analytic  tools,  it  is  worth  looking 
more  closely  at  where  the  meaning  lies  in  a  data 
display. 

The  Relationship  of  Form  and  Content 

The  information  value  of  a  table  or  graph 
depends  or  the  utility  of  the  information  it 
contains.  Communication  value,  however,  is  the 
extent  to  which  the  form  and  content  are  a  one- 
to-one  match.  This  definition  immediately  implies 
the  need  for  a  formal  separation  of  form  and  con¬ 
tent,  and  as  a  next  step,  a  definition  of  meaning 
at  the  content  level. 

Dolby's  definition  of  the  two  components  of 
information  as  the  data  and  the  operations  on  data 
involves  a  distinction  so  basic  that  it  is  rarely 
articulated  (Dolby  and  Clark,  1982).  However,  it 
extends  immediately  to  the  corresponding  compon¬ 
ents  of  language,  the  words  and  their  syntactic 
relationships  in  a  sentence. 

The  meaning  of  a  sentence,  however,  lies  in 
what  the  sentence  says.  Thus  a  third  component 
of  information  is  the  intended  meaning — the  net 
result  of  a  particular  set  of  elements  and  their 
operational  relationships. 

These  three  components  of  information  can  be 
described  in  generic  terms  as  the  data  elements, 


the  set  of  operational  relationships,  and  their 
resultant  meanings,  limited  to  the  set  of  possible 
meanings  in  data  displays: 

Data  operational 

elements  +  relationships  =  meaning 

The  data  elements  consist  of  all  the  elements  of  a 
complete  datum,  with  the  meaning  of  each  descrip¬ 
tor  cleanly  defined  by  the  classification  scheme 
discussed  in  Dolby,  Clark,  and  Rogers  (1986). 

These  elements,  including  the  values,  constitute 
the  "words"  in  a  data  sentence.  Since  the  rela¬ 
tional  statements  in  a  table  or  graph  are  limited 
by  the  allowable  operations  on  data,  the  set  of 
meanings  in  the  Language  of  Data  is  limited  to 
statistical  meanings. 

Because  the  results  of  comparisons  are  often 
supplied  by  mental  arithmetic,  the  derived  data 
are  commonly  thought  of  as  the  information  in  a 
display.  The  information,  however,  lies  in  the 
entire  relational  statement,  which  may  or  may  not 
include  the  derived  datum.  For  example,  in  a 
statement  of  equality,  &_  +  ]>  =  c^,  the  term  c^  has 
information  value  only  with  respect  to  its  com¬ 
ponents — or  as  one  of  the  components  of  a  differ- 
rent  relational  statement  at  the  next  level  of 
derivation. 

The  corresponding  formulations  for  data  dis¬ 
plays  refer  to  the  visible  counterparts  of  the 
content  elements.  In  tables,  as  in  verbal  lan¬ 
guage,  the  syntactic  relationships  of  the  terms 
are  implicit  in  the  terms  being  related.  For  ex¬ 
ample,  the  form  of  a  word  usually  specifies  its 
functional  role  in  a  sentence.  In  a  table  these 
roles  are  denoted  instead  by  spatial  relation¬ 
ships.  In  tables  the  verbal  elements  carry  the 
burden  of  communication: 

Verbal  implicit  implicit 

elements  +  relationships  «=  meaning 

In  visual  language  the  situation  is  reversed. 
The  most  visible  component  in  a  graph  is  the  set 
of  relationships: 

Visual  visible  visible 

elements  +  relationships  =  meaning 

This  formulation  is  a  decomposition  of  a  set  of 
properties  which  are  usually  defined  as  single 
system.  The  properties  themselves  are  discussed 
at  length  in  Bertin's  Semiologie  Graphique  (1983), 
a  detailed  application  of  Kandinsky's  classic 
Point  and  Line  to  Plane  ( 1926)  to  statistical 
data. 

As  Bertin  points  out,  the  properties  of  the 
graphic  system  are  independent  of  content.  In  the 
communication  of  information,  however,  the  objec¬ 
tive  is  to  relate  them.  One  of  the  obstacles  has 
been  a  mismatch  in  definitions.  Whereas  the  en¬ 
tire  graphic  system  is  commonly  thought  of  as 
visual  syntax,  the  set  of  relationships  at  the 
content  level  is  defined  primarily  in  terms  of 
the  data  component. 

Although  an  exact  match  requires  better  articu¬ 
lation  of  the  set  of  relationships  in  statistics, 
the  three  formulations  above  provide  a  framework 
for  defining  the  properties  of  data  displays 
directly  in  terms  of  the  content  elements  they 
express. 


,vhv, 

$: 


rvm  u.iv.T\r  rjr.rjf  \xt\ji 


The  Language  Properties  of  Tables 

Although  the  descriptive  information  in  tables 
is  usually  limited  to  the  elements  needed  to 
identify  the  contents,  a  table  is  essentially  a 
highly  condensed  form  of  text.  In  fact,  when  the 
descriptive  elements  are  classified  under  Dolby's 
formalism  for  a  complete  datum,  the  resulting 
descriptor  set  is  an  even  more  condensed  form  of 
text  (see  Dolby,  Clark,  and  Rogers,  1986). 

Most  of  this  information  is  given  in  the  table 
title,  which  generally  identifies  the  universe  of 
discourse.  In  Figure  2,  for  example,  the  title 
covers  four  of  the  five  dimensions  of  descrip¬ 
tion.  The  space  is  U.S.,  the  function  discussed 
is  consumption,  and  the  matter  consumed  is  energy. 
The  observer  is  rarely  specified  in  derived  data, 
but  in  this  case  the  observers  (EIA's  respondents) 
were  the  end  users. 

There  is  an  important  distinction  in  classifi¬ 
cation  between  the  information  specified  in  the 
description  and  that  supplied  from  external 
knowledge  about  the  data.  The  aspect  descriptor, 
for  example,  does  not  require  an  inference;  the 
BTU  is  formally  defined  as  the  unit  of  heat  con¬ 
tent.  Heat  content,  of  course,  is  merely  the 
aspect  discussed  in  this  representation  of  the 
data,  not  the  aspects  (various)  represented  by 
the  primary  data. 

The  same  words  that  describe  the  contents  of 
the  table,  however,  also  have  another  function. 

At  the  data  level  they  become  the  words  in  set  of 
highly  structured  sentences  all  dealing  with  U.S. 
consumption  of  energy.  In  short,  we  are  looking 
at  an  ordinary  paragraph — a  series  of  sentences 
all  of  which  are  expansions  on  a  single  topic. 

The  subject  of  the  table  is  the  variable 
arrayed  in  the  stub — in  Figure  2,  the  end-use 
sectors.  The  topic  of  the  discussion  is  energy 
consumption,  and  the  specific  topic  is  the  amounts 
of  consumed,  the  set  of  values  in  the  table  field. 
The  column  heads  comprise  the  third  discourse 
variable,  the  "statement  variable."  As  a  set, 
they  specify  the  nature  of  the  statements  made  in 
the  table,  in  this  case  the  change  in  consumption 
over  time.  In  other  words,  the  stub  describes 
what  the  table  is  about,  and  the  column  heads 
describe  what  the  table  says. 


FIGURE  2  The  descriptive  elements  of  a  table 

U.S.  CONSUMPTION  OF  ENERGY,  BY  END-USE  SECTOR 
AND  YEAR;  1977-1980 

Quadrillion  BTU's 


End-use  sector 

1978 

1979 

1980 

Residential  & 
commercial 

28.159 

27.462 

28.283 

Industrial 

29.373 

31.551 

30.284 

Transportation 

20.612 

19.950 

18.623 

Source:  James  L.  Dolby,  Data  Analysis:  Tables  In, 
Tables  Out,  Online  *84,  1984,  Data  from  Monthly 
Energy  Review.  June  1981,  p.  18. 


In  a  formal  table  all  three  discourse  variables 
are  named  in  the  table  title.  The  topic  component 
of  the  title,  which  states  the  topic  of  the  table, 
refers  directly  to  the  elements  in  the  table  field. 
The  next  two  components  usually  appear  in  the  par¬ 
tition  rules.  The  subject  component  names  the 
variable  listed  in  the  stub,  and  the  statement 
component  names  the  category  in  the  column  heads — 
generally  in  that  order.  Thus  a  properly  con¬ 
structed  table  title  provides  a  useful  structural 
description  of  the  table  for  retrieval.  Data  are 
stored  (and  later,  used)  by  classification  cate¬ 
gory,  they  are  retrieved  by  discourse  category. 

Although  most  dictionaries  give  only  circular 
definitions  of  subject  and  topic,  the  difference 
between  them  shows  up  in  their  higher-level  struc¬ 
tures.  The  chief  characteristic  of  the  subject 
is  unity,  whereas  coherence  refers  to  the  topical 
progression.  Thus  at  the  publication  level  the 
subject  is  a  constant  across  the  publication;  top¬ 
ics,  however,  come  in  sets  by  definition.  Once 
the  subject  has  been  partitioned  off,  the  contents 
are  organized  into  some  logical  topical  sequence, 
with  the  discussion  under  each  main  topic  parti¬ 
tioned  into  topical  subsets,  which  also  have  a 
progressive  relationship. 

The  same  properties  show  up  in  tables  as  a 
single  variable  in  the  stub  and  a  set  of  variables 
in  the  columns,  often  grouped  into  subsets  by 
spanner  heads.  One  of  the  requirements  for  the 
stub  is  that  the  subject  elements  be  an  aggregat- 
able  set.  In  contrast,  the  columns  may  be  related 
by  any  signs  of  operation;  the  only  requirement  is 
that  they  all  belong  to  some  category  that  can  be 
named.  In  some  cases  this  higher-level  node  may 
be  fairly  high  up  in  the  topic  structure.  The 
topic  structure  itself  is  an  aggregation  tree,  of 
course.  However,  the  fundamental  operations  on 
data  apply  not  only  to  nominal  variables,  but  to 
the  domain  of  visual  variables  in  graphs. 

The  relationships  between  visual  and  verbal 
language  is  easiest  to  see  in  tables.  A  table, 
like  any  other  visual  representation,  exists  at 
the  most  fundamental  level  as  a  set  of  visual 
events  which  are  perceived  and  organized  by  the 
eye  in  successive  stages  of  resolution.  The 
process  is  similar  to  the  effect  of  decreasing 
distance.  For  example,  if  a  page  of  text  is  held 
far  enough  away,  the  only  discernible  form  is  the 
page.  As  the  distance  is  decreased  a  printed 
image  becomes  visible,  first  as  a  gray  area  and 
then  as  a  pattern  of  uniform  lines.  Well  beyond 
reading  range,  the  lines  become  recognizable  as 
strings  of  words,  and  eventually  enough  letter- 
forms  become  discernible  for  the  words  to  be  read. 
For  material  at  close  range,  the  corresponding 
progression  from  peripheral  vision  to  attention 
and  perception  is  merely  an  instantaneous  change 
of  focus. 

With  tables,  then,  there  is  quite  a  lot  that 
goes  on  before  the  reader  gets  to  the  data.  The 
first  perceptual  task  is  bounding  the  set  of 
events  to  be  organized,  and  the  next  is  locating 
the  visual  characteristics  that  identify  the  image 
as  a  table.  With  some  tables  this  is  a  nontrivial 
problem,  but  even  beyond  reading  range  most  tables 
have  two  identifying  features:  they  contain  a 
matrix  area,  with  a  discernible  pattern  of  rows 
and  columns,  and  they  consist  of  material  that  is 
to  be  read. 
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(j)  The  scanning  stage 
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(6)  Reading  and  primary  comparisons 
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(c)  Second-level  comparisons 


FIGURE  3^  Visual  syntax  in  tables 

From  Dolby  and  Clark,  The  Language  of  Data  (1982) 


The  second  characteristic  provides  an  important 
cognitive  cue:  recognition  that  material  is  to  be 
read  immediately  directs  the  viewer  to  the  start¬ 
ing  point  for  reading.  In  Western  cultures  this 
is  the  upper  left  corner  of  the  image — usually  the 
beginning  of  the  table  title. 

From  this  starting  point  the  rest  of  the  table 
is  scanned  to  locate  and  identify  its  structural 
members.  This  search  for  visual  structure  takes 
place  with  all  visual  displays,  but  the  scanning 
pattern  for  tables  follows  a  specific  sequence. 

With  all  written  materials  scanning  tends  to  be 
vertical,  in  the  sense  that  it  starts  at  the  top 
and  progresses  downward.  Reading  is  horizontal, 
however,  so  that  the  scan  usually  consists  of  a 
horizontal  sweep  of  the  column  heads,  followed  by 
a  vertical  sweep  of  the  stub. 

This  sequence  of  directional  responses  narrows 
the  scan  down  to  the  critical  area — the  values  in 
the  table  field.  Whereas  the  preceding  responses 
are  universal,  the  processes  that  go  on  in  this 
part  of  the  table  vary  widely  with  the  reader's 
experience,  interests,  and  ability  to  handle  num¬ 
bers. 

Experienced  data  users  will  usually  look  at  the 
field  as  a  whole  to  determine  the  range  of  values, 
and  then  scan  horizontally,  vertically,  and  perhaps 
along  the  diagonal  axis  for  any  obvious  patterns 
of  increase  or  decrease.  The  general  reader  may 
scan  the  table  field  only  for  obvious  exceptions 
to  uniformity,  or  simply  begin  reading  across  each 
row  as  soon  as  the  column  heads  and  stub  have  been 
identified. 

At  both  ends  of  the  experience  scale,  however, 
before  the  data  can  be  analyzed  they  have  to  be 
read — and  even  numbers  are  read  from  left  to  right. 
As  a  result,  it  is  the  directional  preference  for 
reading  that  determines  the  order  in  which  the 
comparisons  are  seen  in  a  table.  For  example, 
both  tables  in  Figure  4  contain  the  same  data. 


In  the  first  table  the  most  immediate  observation 
is  that  the  prison  population  increased  consid¬ 
erably  from  1976  to  1980;  in  fact,  the  female 
population  almost  doubled.  In  the  second  table 
the  first  observation  is  that  the  ratio  of  males 
to  females  is  more  than  20  to  1 .  Although  both 
conclusions  can  be  drawn  from  either  table,  the 
orientation  of  the  table  matrix  determines  the 
order  in  which  this  information  becomes  visible. 

The  reading  stage  in  Figure  3]>  reduces  each  row 
of  the  table  to  a  result,  a  new  datum  at  the  next 
level  of  derivation,  so  that  the  set  of  results 
can  be  compared — this  time  in  a  vertical  column. 
Thus,  if  we  stick  to  the  primary  comparisons  in 
the  table,  there  is  another  right-angle  change 
which  now  brings  the  result  column  into  focus. 
Although  data  are  compared  horizontally,  numbers 

FIGURE  Effect  of  orientation  of  the  table 
matrix  on  meaning 

California  state  prison  population: 


1976 

1980 

Males 

15,891 

20,608 

Females 

590 

1,039 

Males 

Females 

1976 

15,891 

590 

1980 

20,608 

1,039 

Data  from  California  Prisoners,  California 
Department  of  Corrections,  1979  and  1983. 
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are  easier  to  compare  in  a  column.  Hence  taking 
the  primary  comparisons  across  the  table  now  puts 
the  next  level  of  comparison  into  the  most  advan¬ 
tageous  position  for  the  higher-level  operations 
of  statistics. 

Because  the  last  stage  in  Figure  3  is  often  the 
starting  point  for  the  analyst,  it  is  easy  to  miss 
the  language  structure  at  the  preceding  stages. 
However,  the  visual  sorting  that  goes  on  at  the 
scanning  stage  merely  establishes  the  discourse 
structure  of  the  table,  the  framework  for  communi¬ 
cation.  Communication  itself  takes  place  at  the 
reading  level.  This  means  that  the  elements  being 
read  have  to  have  some  syntactic  relationship  that 
results  in  an  intelligible  statement. 

The  most  immediately  available  syntax  in  a 
table  lies  in  the  spatial  relationship  of  its 
parts.  Unlike  prose,  in  tables  each  structural 
element  partitions  directly  into  its  grammatical 
counterpart  at  the  sentence  level.  As  a  result, 
all  table  sentences  have  the  same  basic  form.  The 
subject  is  the  item  listed  in  the  stub,  and  each 
column  entry  is  a  statement  of  quantity  about  that 
subject.  Thus,  across  the  whole  row,  the  sentence 
is  a  simple  sentence  with  a  compound  predicate. 

In  the  first  table  in  Figure  4,  for  example, 
the  first  row  might  be  expressed  in  telegram  style 
as 

(California  prison  population  for) 

Males:  15,891  in  1976  and  20,608  in  1980 

Unlike  prose,  the  subject  of  a  table  sentence  cor¬ 
responds  directly  to  the  subject  of  the  discussion. 
The  only  verb  in  a  table  sentence  is  i£  or  was  (or 
for  predictions  will  be);  hence  the  values  in  the 
columns  form  the  main  predicate,  and  the  column 
heads  function  as  governing  clauses.  Although  the 
device  holds  unpleasant  memories  for  many,  these 
relationships  are  easiest  to  see  in  sentence 
diagram: 

(in)  1976  (in)  1980 


Males  (was)  15,891 


20,608 


The  point  of  the  statement,  however,  is  the 
comparison  of  the  two  populations — about  16,000  in 
1976  compared  to  about  20,500  in  1980.  That  sub¬ 
stitution  for  and  is  the  link  between  statistics 
and  simple  prose.  It  is  also  the  source  of  mean¬ 
ing  in  a  data  display.  All  the  statements  in  a 
table  or  graph  are  statements  of  comparison — 
statements  of  additivity,  proportionality,  and 
so  on.  The  meaning  in  any  particular  instance 
depends  on  which  of  these  relationships  is  speci¬ 
fied. 

For  example,  the  transpose  in  Figure  4  changed 
the  subject  (but  not  the  topic)  of  the  table.  In 
the  first  table  the  subject  was  gender,  and  in  the 
second  table  it  was  the  year.  But  it  is  the  state¬ 
ments  about  the  subject  that  are  of  interest,  so 
a  more  important  consideration  is  that  the  two 
tables  have  different  predicates.  As  a  result, 
they  convey  different  information,  despite  the 
fact  that  they  contain  exactly  the  same  data. 

Notice  that  the  comparisons  in  Fig.  4  were  at 
the  variable  level,  not  the  data  level — a  compar¬ 
ison  of  whole  columns  across  the  table.  In  the 
first  table  the  primary  comparison  was  a  differ¬ 


ence,  and  in  the  second  table  it  was  a  ratio.  The 
only  place  in  the  table  these  column  relationships 
are  expressed  is  the  column  heads — the  statement 
variable.  It  is  the  relationships  stated  in  the 
column  heads  of  a  table  that  are  shown  in  the 
field  of  a  graph. 

Statements  of  Comparison  in  Graphs 

Although  the  properties  of  visual  language  are 
usually  defined  as  a  system,  they  partition  easily 
into  to  the  two  components  of  a  relational  state¬ 
ment,  the  set  of  terms  and  the  set  of  relation¬ 
ships.  In  tables  the  focus  is  on  the  terms  in 
the  equation,  and  in  graphs  it  is  on  the  signs  of 
operation. 

The  distinction  between  the  discourse  structure 
of  text  and  communicative  syntax,  which  are  not 
usually  thought  of  as  connected  in  prose,  provides 
us  with  a  second  partition.  The  visual  structure 
of  an  image,  as  opposed  to  the  syntactic  relation¬ 
ships  that  convey  meaning,  is  the  direct  counter¬ 
part  of  the  discourse  structure  of  text.  In  a 
table,  for  example,  the  process  of  visual  sorting 
that  takes  place  at  the  scanning  stage  identifies 
the  subject,  topic,  and  statement  categories — 
which  also  corresponds  (or  should  correspond)  to 
the  information  structure  of  the  contents. 

Like  a  table,  a  graph  also  consists  of  a  series 
of  sentences  all  of  which  are  an  expansion  on  a 
single  topic.  In  graphs,  however,  the  starting 
point  for  reading  is  not  the  upper  left-hand  cor¬ 
ner  of  the  image;  it  is  the  most  prominent  visual 
element.  The  first  sentence  in  the  graph  is  the 
relationship  of  this  element  to  the  next  most 
prominent  element,  where  the  prominence  hinges  on 
common  aspect.  Although  a  number  of  factors  are 
involved,  the  levels  of  subordination  in  a  graph 
are  essentially  a  change  in  common  aspect  at  each 
level. 

The  role  of  aspect  as  a  structural  link  in  data 
classification  is  discussed  in  Dolby,  Clark,  and 
Rogers  (1986).  It  applies  here  in  the  same  sense, 
as  both  the  name  of  the  variable  and  the  specific 
topic.  The  three  aspects  of  color — hue,  chroma, 
and  value  (light/dark) — are  well  known.  However, 
visual  elements  also  have  shape,  size,  orienta¬ 
tion,  and  so  on.  The  relational  elements  all 
have  direction  as  well  as  extent,  and  a  common 
direction  may  be  the  common  aspect. 

In  a  table,  for  example,  the  aspect  descriptor 
links  the  topic  elements  in  the  table  field  to  the 
common  topic  at  that  level,  the  topic  term  in  the 
title.  Where  the  variables  differ  in  aspect,  the 
common  aspect  is  their  relationship,  for  example, 
Boyle's  law  in  the  case  of  pressure,  temperature, 
and  volume.  The  higher-level  structure  of  aspect 
is  independent  of  the  topic  structure;  however,  it 
is  often  of  particular  interest  in  exploring  data. 

Aspect  also  serves  as  the  pivot  term  between 
variables.  In  a  table  it  functions  as  a  hinge  be¬ 
tween  the  subject  variable  in  the  stub  in  the  stub 
and  the  common  topic — the  difference  between  "tem¬ 
peratures  in  patients"  and  "temperature  in  malaria." 
As  in  verbal  language,  the  change  in  antecedents 
represents  a  transfer  of  attention  from  one  object 
to  another.  As  a  result,  although  the  statements 
communicated  by  a  graph  are  not  limited  to  ele¬ 
mentary  comparisons,  they  can  all  be  decomposed 
into  a  sequence  of  simple  comparisons.  All  visual 
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effects  stem  from  a  comparison  with  some  visible 
or  imaginary  reference.  At  the  "reading  stage”  in 
graphs  this  reference  shifts,  with  the  endpoint  of 
one  sentence  becoming  the  starting  point  for  next. 
The  meaning,  however,  lies  in  the  resultant.  In 
general,  graphs  follow  the  concepts  of  vector 
algebra  rather  than  linear  algebra. 

In  graphs  as  well  as  tables,  however,  the  basic 
reference  level  is  the  horizontal  axis.  For  example, 
if  the  image  in  Figure  3b^  is  visualized  as  a  bar 
chart,  the  first  information  it  shows  is  the 
lengths.  If  the  page  is  turned  90  degrees,  to 
make  the  image  a  column  chart,  it  now  shows  com¬ 
parisons  of  lengths — again  read  across  the  graph. 

As  with  the  two  tables  in  Figure  4,  both  pieces  of 
information  can  be  extracted  from  either  graph, 
but  the  orientation  of  the  graph  determines  the 
order  in  which  this  information  becomes  visible. 

All  graphs,  in  fact,  show  both  levels  of  deri¬ 
vation.  In  a  graph,  however,  the  order  can  run  in 
either  direction — as  a  forward  projection  to  the 
next  level,  with  the  focus  on  the  relationships  at 
first  level,  or  as  a  backward  projection  to  the 
components,  with  the  focus  on  the  derived  level. 

As  a  result  graphs  are  capable  of  much  finer 
degrees  of  meaning  than  tables,  but  they  also 
require  a  decision  that  takes  care  of  itself  in 
a  table.  In  a  graph  the  focus  has  to  be  on  one 
level  of  derivation  or  the  other,  since  the  dis¬ 
cussion  cannot  progress  simultaneously  in  both 
directions. 

Because  the  data  elements  are  the  carriers  of 
meaning  in  tables,  the  column  heads  of  a  table  are, 
in  effect,  an  equation  with  elliptical  signs  of 
operation.  These  are  the  relationships  stated  in 
graphs;  hence  in  a  graph  they  have  to  be  specified. 
For  example,  the  graph  in  Figure  shows  crude  oil 
imports  in  relation  to  total  imports.  In  opera¬ 
tional  terms  the  statement  is  a  simple  proportion. 
The  primary  comparisons  are  the  change  over  time, 
so  the  proportion  itself  is  a  second-level  com¬ 
parison,  a  comparison  of  the  structure  of  the 
two  horizontal  comparisons.* 

It  is  easy  enough  to  deduce  from  the  two  curves 
that  crude  oil  imports  accounted  for  most  of  the 
fluctuation  in  the  total.  But  suppose  the  whole 
point  of  the  discussion  is  the  fluctuation  of 
crude  oil  in  relation  to  the  stability  of  other 
commodities.  Fluctuation  and  stability  are  two 
different  aspects  of  the  data,  so  the  relation¬ 
ship  in  this  case  is  a  ratio,  not  a  proportion. 

The  graph  in  Figure  5b  shows  exactly  the  same  data, 
but  instead  of  a  component  relationship,  the  two 
separate  curves  enable  the  reader  to  see  the 
ratio  of  fluctuation  to  stability. 

Again,  the  secondary  information  in  the  graph 
can  be  deduced;  total  imports  is  simply  the  sum  of 
the  two  quantities.  Notice,  however,  that  this 
deduction  involves  mental  arithmetic,  not  percep¬ 
tion;  the  two  fall  in  different  domains.  In  fact, 
although  the  graph  shows  the  ratio  of  two  quanti¬ 
ties,  the  primary  visual  comparison  is  not  the 
quantitative  ratio.  It  is  the  ratio  of  fluctua¬ 
tion  to  stability. 

If  the  point  were  the  amounts,  the  most  appro¬ 
priate  form  would  be  one  that  focuses  on  this 


*At  the  time  of  his  death  Dolby  was  working  on 
a  statistical  basis  for  the  comparison  of  compari¬ 
son  structures. 


Average  daily  volume, 
miHkm  barrek/day 


Includes  imports  ol  crude  oil  to  the  Strategic  Petroleum  Reserve.  Data  trom  the  Monthly 
Petroleum  Statistics  Report,  January  I960. 


(a)  A  proportional  relationship  of  two  quantities 


Includes  imports  ot  crude  oil  to  the  Strategic  Petroleum  Reserve.  Data  Irom  the  Monthly 
Petroleum  Statistics  Report.  January  I960. 


(b)  A  ratio  of  two  properties:  fluctuation  and 
stability 


FIGURE  5  Operational  relationships  in  graphs 
From  Clark,  Sample  Pages  and  Specifications: 
Monthly  Petroleum  Statement  (1981) 
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Current  source:  Clark;  1981 
Original  source:  EIA;  1979-1980 


Observer : 
Matter : 
Function: 
Space: 
Time: 


Petroleum  importing  companies 

Crude  oil 

Imports 

U.S. 

12/78  to  12/79 


Aspect:  Volume,  mbbl  [-*-  Imports  (F) ] 
Domain:  9.2,  9.0,  .  .  .  ,  8.4 

Aspect:  Fluctuation  [-*-  Imports  (F)] 
Domain:  Pattern  of  variation 
Domain:  Amount  of  variation 
Domain:  Variation  with  respect  to  reference 


FIGURE  (>  Possible  descriptor  set  for  Figure  _5 


aspect  of  the  data  instead,  a  table.  There  are 
two  factors  involved.  One  is  the  focus  on  the 
particular  aspect  of  the  data  most  efficiently 
communicated  by  numerical  symbols.  The  other  in¬ 
volves  screening  out  factors  that  distract  from 
that  focus,  and  in  particular  those  elements  that 
show  something  else. 

The  visual  variable  that  makes  fluctuation  and 
stability  explicit  in  Figure  5b  is  shape;  thus  in 
another  view  of  the  data  this  variable  will  be  the 
chief  source  of  noise.  Removing  it  altogether 
implies  a  table — unless,  of  course,  the  focus  is 
on  some  other  aspect  of  the  data,  expressed  by  a 
different  visual  variable. 

The  descriptor  sets  for  data,  discussed  in 
Dolby,  Clark,  and  Rogers  (1986),  have  especially 
important  implications  for  graphs.  Whereas  the 
data  component  is  the  visible  information  in  a 
table,  in  graphs  the  entire  descriptive  structure 
has  to  be  supplied  through  another  domain  (as  with 
the  labels  on  graphs).  In  an  interactive  system 
this  information  would  have  to  be  available  re¬ 
gardless  of  the  mode  of  representation. 

Descriptor  sets  play  a  more  immediate  role,  as 
a  mechanism  for  specifying  the  meaning  to  be  com¬ 
municated  by  a  table  or  graph.  The  form  developed 
for  the  classification  of  individual  variables  in 
a  microdatabases  is  designed  as  independent  mode 
representation  which  summarizes  the  information 
structure  of  the  variable  as  well  as  its  contents. 
The  variables  in  a  graph,  however,  require  more 


precise  specification  of  the  intended  meaning. 
In  Figure  6,  for  example,  the  aspect  descriptor 
links  structures  in  more  than  one  domain  of 
representation. 

Conclusion 


Although  the  issue  is  usually  posed  as  "tables 
versus  graphs,"  for  interactive  use  the  answer  may 
be  both,  depending  on  the  analytic  step.  There 
are  a  number  of  situations  in  which  the  analyst 
may  want  to  switch  back  and  forth  from  one  mode  of 
representation  to  the  other,  either  to  focus  on  a 
particular  aspect  of  the  data  or  to  move  to  a  com¬ 
putational  step.  A  table  manipulator  designed 
specifically  for  interactive  use  is  discussed  in 
Rogers  (1986). 

Once  the  applicability  conditions  for  relating 
two  variables  have  been  defined,  the  descriptor 
sets  will  have  a  communication  structure  as  well. 
Specification  of  the  form  of  display  will  then  be 
a  matter  of  matching  the  aspect  of  the  data  the 
analyst  wants  to  see,  first  to  its  domain  at  the 
content  level,  and  then  to  the  aspect  of  repre¬ 
sentation  that  makes  this  information  visible. 

The  development  in  this  area  has  just  begun,  but 
there  is  reason  to  think  the  language  of  data  will 
ultimately  be  trilingual. 
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IMPLICATIONS  OF  THE  LANGUAGE  OF  DATA  FOR  COMPUTING  SYSTEMS 
William  H.  Rogers  [1] 

The  Rand  Corporation  and  the  Language  of  Data  Project 


The  Language  of  Data  (LOD)  is  a  program  of 
basic  research  into  the  communication  of  quanti¬ 
tative  information.  The  concepts  developed  under 
this  umbrella  range  from  the  formal  definition 
of  a  single  datum  through  formalisms  for  micro¬ 
databases,  tables,  graphs,  and  the  relationships 
between  them.  The  theoretical  aspects  are  cov¬ 
ered  in  companion  papers  by  Dolby,  Clark,  and 
Rogers  [2],  and  Clark  [3]. 

This  paper  discusses  the  kind  of  computer  sys¬ 
tem  envisioned  under  the  Language  of  Data,  includ¬ 
ing  applications  which  have  been  implemented  and 
elements  not  yet  developed.  As  such,  the  system 
represents  a  design  consideration  for  future  soft¬ 
ware  developers  rather  than  a  finished  product. 
Moreover,  the  set  of  programs  discussed  here 
should  not  be  confused  with  the  theory  itself. 

The  understanding  of  the  structure  of  the  datum 
and  of  tables  developed  by  this  project  could  be 
applied  to  other  programs  as  well. 

The  envisioned  programs  focus  primarily  on 
three  themes:  First,  they  focus  on  a  principal 
application  of  LOD,  the  documentation  of  large 
databases.  Second,  they  illustrate  some  of  the 
formal  ideas  (computation  with  descriptors  and 
applicability  conditions)  in  ways  familiar  to 
statisticians.  Third,  they  incorporate  some  of 
the  insight  into  table  structure  to  provide  sev¬ 
eral  natural  and  powerful  tools  for  the  data 
analyst. 

The  Overarching  Computing  Plan 

The  "grand  plan"  for  a  computing  system  is 
represented  by  the  diagram  in  Figure  1.  This 
diagram  is  especially  relevant  to  a  large  survey 
study  such  as  those  Rand  or  many  federal  agencies 
would  perform.  Individuals  are  surveyed  in  sev¬ 
eral  different  ways  and  the  data  are  collected 
into  a  microdatabase.  The  organizational  struc¬ 
ture  of  the  survey  itself  results  in  a  houselike 
structure  for  the  microdatabase,  as  shown  in 
Figure  2. 


The  individuals  (cases)  form  the  stub  of  the 
microdatabase  and  the  variables  are  named  in  the 
column  heads.  Each  variable  comes  with  a  set  of 
descriptors  that  make  up  the  second  story  of  the 
house,  and  the  attic  consists  of  descriptor  sets 
at  the  topic  level  that  tie  the  columns  of  the 
microdatabase  together.  The  field — the  two-way 
matrix  of  numbers — is  the  portion  of  the  data 
structure  statisticians  and  statistical  packages 
focus  on.  The  overall  structure  might  be  viewed 
as  a  shorthand  device  for  summarizing  the  descrip¬ 
tive  content  of  each  individual  datum.  However, 
this  is  a  case  in  which  the  whole  is  greater  than 
the  sum  of  its  parts. 

LOD  is  not  concerned  with  relational  struc¬ 
tures  and  other  features  of  certain  existing 
databases.  Statisticians  typically  view  these 
as  expanded  matrices  and  aggregate  them  to  a  more 
convenient  unit  of  analysis,  and  then  merge  the 
results  with  other  data  having  comparable  units 
of  analysis. 

The  first  step  is  to  find  our  way  around  by 
means  of  the  descriptor  sets.  They  themselves 
form  a  matrix  in  which  the  columns  correspond  to 
the  descriptors  and  the  cases  are  the  variables  of 
the  microdatabase.  This  is  the  transpose  of  the 
way  they  appear  in  the  house  structure.  The  com¬ 
puterized  tool  for  doing  this  is  a  descriptor 
manipulator  called  IDEA  [4], 

IDEA  is  an  experimental  program  which  operates 
on  the  transposed  matrix  of  descriptors,  using 
both  familiar  and  novel  tools  of  data  analysis. 

It  is  especially  good  for  moving  around  in  the 
data  to  gain  a  view  of  the  whole,  and  for  visually 
selecting  and  arranging  data  (in  terms  of  their 
verbal  descriptors)  using  techniques  for  holding 
some  of  the  data  fixed  and  sorting  others.  There 
are  also  operators  that  uncover  hierarchic  struct¬ 
ure  through  a  highlighting  technique  sometimes 
called  "slipping."  More  details  about  the  program 
are  given  below. 

This  kind  of  analysis  was  especially  popular 
when  statisticians  kept  data  on  wall  charts.  It 


FIGURE  1 


FIGURE  3 
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became  unfashionable  when  computers  made  it  more 
difficult  to  find  one's  way  around  a  large  data 
set  than  it  was  to  do  a  regression  analysis.  In 
my  experience  as  a  consulting  statistician,  forc¬ 
ing  the  client  to  visually  confront  his  or  her 
data  is  more  effective  in  revealing  truth  than 
the  most  powerful  of  statistical  tests. 

The  upper  levels  of  the  microdatabase  struc¬ 
ture  also  provide  indexing  possibilities,  for 
locating  individual  variables  or  subsets  of  var¬ 
iables  at  the  lower  levels.  This  is  important  in 
a  large  study  which  may  involve  many  collaborators 
and  10,000+  variables.  One  can  envision  library- 
style  searches  of  the  descriptor  sets,  together 
with  browsing  capabilities  among  similar  sets. 
These  possibilities  are  currently  being  explored 
at  Rand. 

Moving  down  to  the  next  level  of  Figure  1,  we 
have  the  key  undeveloped  piece,  the  "Executive 
Intermanipulator"  and  its  slave,  the  ubiquitous 
statistical  package.  The  job  of  the  Executive 
Intermanipulator  is  to  convert  data  in  micro¬ 
database  form  to  tabular  form  by  aggregation, 
tabulation,  summarization,  or  the  more  complex 
operations  of  statistical  analysis.  This  activity 
is  usually  performed  by  the  primary  statistical 
analyst. 

To  aggregate  successfully,  the  Executive  Inter¬ 
manipulator  must  know  the  rules  of  aggregation. 

It  should  check  the  glossary  to  determine  whether 
the  quantity  being  aggregated  is  additive  and  the 
thesaurus  to  know  whether  the  units  being  aggre¬ 
gated  oyer  form  an  exhaustive  partition. 

The  output  of  the  Executive  Intermanipulator 
might  be  a  table  or  a  graph.  Adequate  titles, 
column  heads,  source  information,  units,  and 
labels  would  be  generated  from  the  microdatabase. 
These  might  be  awkward  in  wording;  they  would  be 
generated  algorithmically  and  automatically. 

The  Executive  Intermanipulator  is  also  respon¬ 
sible  for  operations  on  microdatabase  variables 
that  create  other  microdatabase  variables.  For 
example,  it  would  draw  on  the  glossary  to  deter¬ 
mine  that  population  divided  by  land  area  gives 
population  density  and  it  would  query  the  user  if 
he  or  she  attempts  to  subtract  land  area  from  mean 
January  temperature,  since  that  constitutes  an 
invalid  comparison  (incommensurate  units  of  meas¬ 
urement  or  descriptors  that  differ  in  more  than 
one  dimension). 

If  the  result  of  the  operation  is  a  table,  then 
the  table  should  be  in  standard  computer  table 
interchange  format.  This  computerized  format 
breaks  down  the  table  into  its  structural  compon¬ 
ents.  (The  structural  components  of  a  table  are 
described  in  Clark  (I).) 

Tables  may  also  be  drawn  from  online  sources. 
Gateway,  an  experimental  program  developed  by  the 
San  Jose  State  University  Mathematics  Clinic  | h | 
under  the  sponsorship  of  the  Language  of  Data 
Project,  can  query  the  Lockheed  Dialog  informa¬ 
tion  retrieval  system  for  certain  types  of  online 
tables.  These  tables  are  also  generated  in  the 
standard  interchange  format.  Additional  informa¬ 
tion  on  the  Gateway  program  is  contained  in  the 
Mathematics  Clinic  report  |6j. 

The  key  tool  for  secondary  analysts  is  the 
Table  Manipulator.  This  program,  developed  by 
Rogers,  is  written  in  a  general-purpose  language 
and  has  been  implemented  in  experimental  form  on 
an  IBM-PC.  The  Table  Manipulator  displays  and 


operates  on  tables  in  the  standard  format.  In 
addition  to  select  and  arrange  operators  charac¬ 
teristic  of  the  descriptor  manipulator,  the  Table 
Manipulator  can  combine  tables  and  perform  trans¬ 
formations  and  statistical  analyses.  Exploratory 
data  analysis  methods  are  available.  The  table 
structure  and  the  contents  of  the  table,  including 
verbal  information,  are  used  (in  conjunction  with 
the  glossary)  to  guide  the  computation  and  to 
check  applicability  conditions.  For  example,  if 
the  user  requests  aggregations,  it  checks  that  all 
the  appropriate  partitions  are  represented. 

An  important  property  of  the  Table  Manipulator 
is  that  it  operates  on  tables  to  produce  objects 
which  are  themselves  tables. 

The  Table  Manipulator  shares  with  spreadsheets 
the  fact  that  changes  in  the  data  can  be  immedi¬ 
ately  reflected  in  the  results.  However,  it 
differs  from  spreadsheets  in  several  important 
respects.  The  Table  Manipulator  is  aware  of  table 
structure  and  its  implications,  whereas  spread¬ 
sheets  require  the  user  to  specify  what  this 
structure  is.  The  Table  Manipulator  can  compute 
with  the  labels,  the  spreadsheet  cannot.  Taking 
advantage  of  the  first  two  properties,  the  Table 
Manipulator  can  invoke  statistical  methodology 
without  requiring  specification  at  the  individual 
datum  level  of  what  to  do.  The  Table  Manipulator 
has  available  a  library  of  related  information 
(via  the  Gateway  program  or  other  statistical 
analysis)  to  draw  upon  the  analysis.  Finally,  it 
permits  revision  and  editing,  with  full  access  to 
a  step-by-step  reanalysis  of  the  data. 

If  the  Table  Manipulator  is  a  tool  of  analysis, 
what  constitutes  an  adequate  audit  trail  for  such 
work?  Most  interactive  statistical  packages  create 
a  of  commands  that  have  been  executed.  The 

Table  Manipulator  also  has  this  capability.  The 
analyst  can  then  go  back  to  a  previous  step  by  re¬ 
running  the  sequence  of  commands  up  to  the  given 
point.  The  analysis  can  also  be  repeated  with  new 
data,  making  it  possible  to  do  "yesterday’s  anal¬ 
ysis  on  today’s  data." 

The  Table  Manipulator  (and  the  descriptor 
manipulator  IDEA)  go  beyond  this  by  giving  the 
user  full  access  to  displays  in  previous  steps, 
recorded  as  a  state  of  the  data  and  display  rather 
than  a  fixed  screen  of  characters.  The  analyst 
can  go  back  to  a  previous  state  and  continue  from 
there  simply  by  pushing  a  few  keys.  Moreover,  if 
editing  changes  have  been  made,  the  refreshed  dis¬ 
plays  can  reflect  the  new  information. 

Elaborating  on  this  technology  leads  us  to  a 
cool  that  enhances  or  clarifies  the  structure  of 
the  comparisons  in  tables,  a  comparison  struc¬ 
ture  is  a  systematic  way  of  making  comparative 
inferences  from  the  contents  of  the  table.  For 
example,  each  column  or  row  might  be  compared  with 
its  neighboring  column  or  row.  The  place  to  start 
is  by  elucidating  comparison  structures  employed 
by  naive  and  expert  readers.  More  complex  tech¬ 
nologies  move  toward  exploratory  data  analysis 
ideas  pioneerec  by  Tukey  [7], 

The  Table  Manipulator  is  a  considerably  dif¬ 
ferent  tool  from  a  table-producing  language  such 
as  TPL  [8].  TPL  is  designed  to  produce  attrac¬ 
tive  tabular  displays  from  data  sets  but  is  not 
designed  to  operate  on  the  data  in  the  tables, 
except  though  certain  formatting  commands.  That 
is,  TPL  does  not  simulate  the  structural  elements 
of  tables,  but  simply  produces  them  line  by  line. 
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The  Table  Manipulator  is  wore  like  a  spread¬ 
sheet  in  its  feel,  but  differs  from  a  spreadsheet 
in  several  respects.  First,  it  takes  advantage  of 
table  structure,  including  the  inherent  row  and 
column  structure  and  verbal  information,  such  as 
"Total".  Second,  it  has  two-way  table  operators, 
such  as  analysis  by  means  and  exploratory  data 
analysis  operators,  built  in.  Third,  it  has  the 
ability  to  track  what  has  changed  and  to  go  back 
and  forth  between  displays.  Fourth,  there  is  a 
library  of  information  available  through  sources 
such  as  Gateway  that  may  be  combined  with  the 
given  table  for  use  in  making  comparisons. 

Some  related  work  has  also  been  done  on  the 
kind  of  mathematically  rigorous  language  and  table 
structure  needed  for  the  overarching  system  out¬ 
lined  here.  Graves  and  Blaine  [9],  working  in 
collaboration  with  the  Language  of  Data  Project, 
have  described  a  computer  language  called  ALGOS 
which  facilitates  the  description  of  statistical 
methods  in  terms  of  algorithms  or  applicability 
conditions.  ALGOS  has  extensible  data  types,  so 
that  the  descriptive  component  of  a  datum  can  be 
carried  and  processed  with  its  numerical  compon¬ 
ent.  Graves  and  Manor  [10]  have  discussed  the 
structure  of  a  table  in  this  framework. 

Descriptors  and  the  Descriptor  Manipulator 

A  basis  of  the  Language  of  Data  theory  is  that 
the  datum  contains  both  a  descriptive  and  a  num¬ 
erical  part,  where  the  descriptive  part  consists 
of  a  common,  one-level  classification  of  informa¬ 
tion.  The  terminology  in  the  following  examples 
is  defined  in  Dolby,  Clark,  and  Rogers  [2], 

Thus  we  might  create  two  data,  as  shown  in  Fig¬ 
ure  3a,  which  would  be  combined  with  the  stub  in  a 
statistical  matrix  as  shown  in  Figure  3b.  The  de- 

FIGURE  3  Structural  Details  of  MICRODATASE 


(a)  Two  data 


statistical  matrix  as  shown  in  Figure  3b.  The  de¬ 
scriptors  for  the  stub,  generated  at  an  earlier 
stage,  contain  information  that  would  be  crucial 
in  a  real  data  set. 

IDEA,  the  descriptor  manipulator,  processes  a 
matrix  of  descriptive  values  consisting  of  the 
entries  for  Source,...,  Domain.  (Recall  that  this 
is  the  top  section  of  Figure  1.)  Like  any  matrix 
of  numerical  values,  the  descriptive  values  can 
be  selected  and  arranged  in  useful  formats.  The 
following  examples  are  from  an  LOD  classification 
of  the  General  Social  Survey  [11]: 


GSS83 


Version 


0/  0 


1 

2 

3 

7 

ID 

Observer 

Matter 

Funct. . . 

Domain 

1 

1 

Adult  mem 

Adult  mem 

Type  ... 

Professio 

2 

2 

N0RC 

Adult  mem 

Occup... 

0-9,  10-1 

3 

3 

Adult  mem 

Adult  mem 

Emplo. . . 

Self-empl 

4 

4 

Adult  mem 

Adult  mem 

Type  ... 

Agricultu 

5 

5 

N0RC 

Adult  mem 

DOT  o... 

Relationa 

6 

6 

N0RC 

Adult  mem 

Relat . . . 

Synthesiz 

7 

7 

N0RC 

Adult  mem 

Relat... 

Mentoring 

8 

• 

8 

N0RC 

Adult  mem 

Relat . . . 

Setting-u 

20 

20 

N0RC 

Spouse 

Occup... 

Lowest  le 

On  0 
Row  1  Col  7 

The  display  fills  the  IBM-PC  display  (25x80) 
and  has  function  key  commands  which  instantly 
scroll  through  the  database.  The  function  keys 
expand  or  contract  fields. 

One  can  arrange  rows  and  columns  in  a  specific 
order,  as  shown  in  the  next  display: 


Source: 

Observer: 

Matter: 


Rand 

Rand  Nurse 
Mary  Jones 


Function:  Systolic  Blood  Pressure 

Space:  Rand  Examination  Center 

Time:  8  am,  January  6,  1986 

Aspect:  Pressure,  mm  Hg  (F)  Systolic  BP 

Domain:  120 


Source:  Rand 

Observer:  Rand  Nurse 

Matter:  John  Smith 

Function:  Systolic  Blood  Pressure 

Space:  Rand  Examination  Center 

Time:  8  am,  January  6,  1986 

Aspect:  Pressure,  mm  Hg  (F)  Systolic  BP 

Domain:  135 


(bj  Data  in  microdatabase  form 


Stub 

Variable 

Source: 

Rand 

Rand 

Obsvr : 

Survey  Ctr. 

Rand  Nurse 

Matter : 

Patient 

Patient 

Functn: 

Identity 

Systolic  Blood  Pressure 

Space: 

Universal 

Rand  Examination  Center 

Time: 

Universal 

8  am,  January  6,  1986 

Aspect: 

Name  (F)  Id. 

Pressure,  mm  Hg  (F)  Systolic  BP 

Domain : 

Alphabetic 

120 . 155 

Values: 

Mary  Jones 

120 

John  Smith 

135 
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GSS83  Version  0/  0 

1  2  3  4 

ID  Space  Observer  Matter  Funct. 


The  number  in  the  upper  left  of  each  display  is  the 
display  number  generated  by  the  program. 

A  command  to  total  the  columns  would  produce 
display  2: 


1 

1 

Continent 

Adult 

mem  Adult 

mem 

Type  ... 

2.  U.S.  CONSUMPTION  OF  ENERGY, 

2 

2 

Continent 

NORC 

Adult 

mem 

Occup. . . 

by  END-USE  SECTOR  and  by  YEAR:  1977- 

-1980 

3 

3 

Continent 

Adult 

mem  Adult 

mem 

Emplo. . . 

4 

4 

Continent 

Adult 

mem  Adult 

mem 

Type  ... 

(BTU  x  10**9) 

5 

5 

Continent 

NORC 

Adult 

mem 

DOT  o... 

6 

6 

Continent 

NORC 

Adult 

mem 

Re let  •  •  • 

YEAR 

7 

7 

Continent 

NORC 

Adult 

mem 

Relat. . . 

8 

8 

Continent 

NORC 

Adult 

mem 

Relat . . . 

END-USE  SECTOR 

1977 

1978  1979 

1980 

, 

• 

Residential  &  comm 

27.569 

28.159  27.462 

27.283 

. 

# 

. 

Industrial 

29.024 

29.373  31.551 

30.284 

« 

. 

« 

. 

. 

» 

Transportation 

19.735 

20.612  19.950 

18.628 

20 

20 

Continent 

NORC 

Spouse 

Occup. . . 

All  end-use  s[2] 

76.332 

78.150  78.968 

76.201 

— 

— 

— 

- 

-  * 

Total 

76.328 

78.144  78.963 

76.195 

ARR 

4  1 

2  3 

On  0 

Row  1  Col  7 

SOURCE:  Monthly  Energy  Review,  June  1981, 

p.  18 

It  is  also  possible  to  sort  on  any  particular  row 
or  column. 

The  descriptor  manipulator  has  two  interesting 
operators  used  in  conjunction  with  its  sorting 
capabilities.  The  FIX  operator  highlights  a  set 
of  rows  and  columns  and  keeps  them  in  a  fixed 
position  on  the  screen,  sorting  or  arranging 
around  them  wherever  requested  to  rearrange  data. 
This  makes  it  possible  to  compare  a  set  of  fixed 
information  with  another  set  physically  located 
in  a  different  part  of  the  display  without  having 
to  create  an  artificial  sort  key  or  otherwise 
disturb  the  key  information. 

The  SLIP  operator  uses  bold  (high  intensity)  to 
display  hierarchic  structure.  If  an  item  and  all 
the  items  left  of  it  are  the  same  as  in  the  pre¬ 
vious  row,  then  the  item  is  not  in  bold.  When  a 
column  entry  changes  from  the  one  above  it,  that 
entry  and  all  those  to  the  right  are  shown  in  bold. 

The  descriptor  manipulator  also  has  the  ability 
to  transpose  the  data  matrix,  revealing  expanded 
detail  in  the  descriptor  sets. 

Operation  of  the  Table  Manipulator 

The  basic  operating  format  of  the  Table  Manip¬ 
ulator  is  a  display  approximating  the  desired  form 
of  the  table: 

1.  U.S.  CONSUMPTION  OF  ENERGY, 

by  END-USE  SECTOR  and  by  YEAR:  1977-1980 

(BTU  x  10**9) 

YEAR 


NOTE  Is  07/05 /84.JLD 

NOTE  2:  Totals  may  not  equal  sum  of  components 
due  to  independent  rounding 

Command?  total 
Command? 

The  rows  or  columns  may  be  sorted  with  a 
command: 

Command?  sort  1980 

Suppose  we  now  want  to  look  at  a  table  of  sport 
parachuting  deaths  which  is  stored  as  ’dpara'  on 
the  disk  in  our  table  interchange  format: 

Command?  read  dpara 

We  might  then  calculate  both  the  row  and  column 
totals: 

5.  DEATHS  FROM  SPORT  PARACHUTING, 

by  JUMP  EXPERIENCE  and  by  YEAR 


YEAR 


Number 


of  jumps 

1973 

1974 

1975 

Total 

1-24 

14 

15 

14 

43 

25-74 

7 

4 

7 

18 

75-199 

8 

2 

10 

20 

200+ 

15 

9 

10 

34 

Unreported 

0 

2 

0 

2 

Total 

44 

32 

41 

117 

END-USE  SECTOR 

1977 

1978 

1979 

1980 

Residential  &  comm 

27.569 

28.159 

27.462 

27.283 

Industrial 

29.024 

29.373 

31.551 

30.284 

Transportation 

19.735 

20.612 

19.950 

18.628 

All  end-use  s[2] 

76.332 

78.150 

78.968 

76.201 

SOURCE:  Monthly  Energy  Review,  June 

:  1981, 

p.  18 

NOTE  1:  07/05/84, JLD 

NOTE  2:  Totals  may  not  equal  sum  of  components 
due  to  independent  rounding 


SOURCE:  Metropolitan  Life  Insurance  Company, 

Stat.  Bull.: 3  p  4  (1979) 

NOTE  1:  03/1 3/86, NC 

Command?  read  dpara 
Command?  totals 
Command? 

The  Table  Manipulator  also  produces  transforma¬ 
tions  of  the  data.  When  a  transformation  is  done, 
a  question  arises  of  what  to  do  with  totals.  Or¬ 
dinarily  a  total  should  be  eliminated,  but  a  mean 
should  be  recalculated;  so  the  program  is  designed 
to  do  this. 

The  command  to  do  a  square-root  transformation 
would  produce  display  6: 
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6.  (TRANSFORMED)  DEATHS  FROM  SPORT  PARACHUTING, 
by  JUMP  EXPERIENCE  and  by  YEAR 


DEATHS  FROM  SPORT  PARACHUTING, 
by  JUMP  EXPERIENCE  and  by  YEAR 


Number 
of  jumps 

1973 

YEAR 

1974 

1975 

Number 
of  jumps 

1973 

YEAR 

1974 

1975 

Effect 

1-24 

3.7 

3.9 

3.7 

1-24 

0 

2 

0 

6 

25-74 

2.6 

2.0 

2.6 

25-74 

0 

-2 

0 

-1 

0 

75-199 

2.8 

1.4 

3.2 

75-199 

0 

-5 

2 

200+ 

3.9 

3.0 

3.2 

200+ 

5 

0 

0 

2 

Unreported 

0.0 

1.4 

0.0 

Unreported 

0 

3 

0 

-8 

SOURCE:  Metropolitan  Life  Insurance  Company, 

Effect 

0 

-1 

0 

8 

Stat.  Bull.: 3  p  4  (1979) 

NOTE  1:  03/13/86, NC 

NOTE  2:  values  transformed  by  sqrt(*) 

Command?  totals 
Command?  sqrt(*) 

Command? 

We  can  also  calculate  percentages  or  other 
functions  of  the  data,  using  either  the  entire 
value  field  or  specific  rows  and  columns.  The 
program  asks  whether  an  analysis  should  be  done 
by  means  or  by  medians.  If  the  response  were 
means,  the  result  would  be  the  table  shown  in 
display  7: 

7.  (TRANSFORMED)  DEATHS  FROM  SPORT  PARACHUTING, 


- -  me  insurance 

Stat.  Bull.: 3  p  4  (1979) 

NOTE  1:  03/13/86, NC 

NOTE  2:  two-way  analysis  by  medians 

Command?  back 

Command?  analyze  medians 

Command? 


DEATHS  FROM  SPORT  PARACHUTING, 
by  JUMP  EXPERIENCE  and  by  YEAR 


Number 

by  JUMP  EXPERIENCE  and  by  YEAR 

YEAR 

Number 
of  jumps 

1-24 

1973 

0 

YEAR 

1974 

2 

1975 

0 

Effect 

6 

(Hypothet) 
1980  jumps 
336 

of  jumps 

1973 

1974 

1975 

Effect 

25-74 

0 

-2 

0 

-1 

525 

1-24 

-0.2 

0.2 

-0.1 

1.3 

75-199 

0 

-5 

2 

0 

642 

25-74 

0.1 

-0.3 

0.2 

-0.1 

200+ 

5 

0 

0 

2 

2023 

75-199 

0.2 

-1.0 

0.7 

-0.0 

Unreported 

0 

3 

0 

-8 

200+ 

Unreported 

0.4 

-0.6 

-0.2 

1.0 

-0.2 

-0.5 

0.9 

-2.0 

Effect 

0 

-1 

0 

8 

Effect 

0.1 

-0.1 

0.0 

2.5 

SOURCE:  Metropolitan  Life  Insurance  Company, 

Stat.  Bull.:3  p  4  (1979);  Imaginary 

SOURCE:  Metropolitan  Life  Insurance  Company, 

Stat.  Bull. :3  p  4  (1979) 

NOTE  1:  03/ 13/86, NC 

NOTE  2:  values  transformed  by  sqrt(*) 

NOTE  3:  two-way  analysis  by  means 

Command?  sqrt(*) 

Command?  analyze  Means  or  medians?  means 
Command? 

After  looking  at  the  means,  we  might  now  wish  to 
back  up  to  the  original  counted  data  and  request 
an  analysis  by  medians,  shown  in  display  8. 

We  might  then  spot  a  flaw  in  these  data:  we 
should  not  be  comparing  across  number  of  jumps 
without  some  denominator.  If  we  had  information 
on  the  number  of  jumps  by  various  categories  in 
1980,  we  could  join  these  (hypothetical)  data  in 
the  table  as  shown  in  display  9. 

Note  the  need  in  joining  data  to  understand 
differences  in  the  stub  values.  In  this  case,  the 
data  joined  were  not  classified  in  precisely  the 
same  way  as  the  data  in  our  existing  tables,  so 
some  interpolation  rule  had  to  be  employed.  The 
rule  employed  in  display  9  was  one  based  on  the 
exponential  distribution. 


Sport  Parachuting  Club  (1982) 

NOTE  1:  03/1 3/86, NC 

NOTE  2:  two-way  analysis  by  medians 

NOTE  3:  03/19/86, WHR 

Command?  join  njumps 

Match  categories  (linear  interpolation)?  exponent 
Command? 

The  Table  Manipulator  has  a  cursor  that  can  be 
moved  from  field  to  field.  For  example,  we  could 
move  the  cursor  to  row  "Unreported"  and  column 
"1974",  and  then  revise  the  original  value  2  with 
the  command  sequence 

Command?  recall  4 
Command?  revise  1 

The  first  command  takes  us  back  to  display  4,  and 
the  second  command  changes  the  value  to  1.  This 
new  value  is  carried  through  to  all  subsequent 
analyses.  Thus  we  can  now  move  forward  in  the 
displays  to  track  the  effect  of  that  revision 
through  the  analysis.  The  effect  on  display  14, 
for  example,  would  be  as  shown.  Notice  the  auto¬ 
matically  generated  footnote  in  this  display. 
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14.  DEATHS  FROM  SPORT  PARACHUTING, 
by  JUMP  EXPERIENCE  and  by  YEAR 


YEAR 


Number 
of  jumps 

1973 

1974 

1975 

Effect 

(Hypothet) 
1980  jumps 

1-24 

0 

2 

0 

6 

336 

25-74 

0 

-2 

0 

-1 

525 

75-199 

0 

-5 

2 

0 

642 

200+ 

5 

0 

0 

2 

2023 

Unreported[2]  0 

•2* 

0 

-8 

Effect 

0 

-1 

0 

8 

SOURCE:  Metropolitan  Life  Insurance  Company, 
Stat.  Bull.:3  p  4  (1979);  Imaginary 
Sport  Parachuting  Club  (1982) 

NOTE  1:  03/13/86, NC 

NOTE  2:  original  1974  value  was  2;  03/ 20/86, WHR 
NOTE  3:  two-way  analysis  by  medians 
NOTE  4:  03/19/86.WHR 


To  summarize,  the  Table  Manipulator  is  an  ex¬ 
perimental  tool  for  secondary  data  analysis  using 
existing  tabular  material.  It  embodies  design 
objectives  dictated  by  the  Language  of  Data  and 
demonstrates  some  areas  in  which  the  design  of 
existing  interactive  packages  might  be  improved. 
It  is  one  of  many  possible  tools  in  a  Language  of 
Data  computer  system  and  should  be  viewed  in  this 
context. 


NOTES: 

[1]  Address:  The  Rand  Corporation,  Santa  Monica,  CA 
90406;  Language  of  Data  Project,  Box  R,  Sausalito, 
CA  94966.  This  work  supported  by  a  grant  from  the 
System  Development  Foundation  through  San  Jose 
State  University. 

[2]  Dolby  JL,  Clark  N,  Rogers  WH:  A  General  Theory 
of  Data,  Presented  at  the  1986  Interface  Meetings, 
Fort  Collins,  CO,  March  1986. 


Command?  forward 
Command?  forward 
Command? 


[3]  Clark  N:  Tables  and  Graphs  as  Language, 
Presented  at  the  1986  Interface  Meetings,  Fort 
Collins,  CO,  March  1986. 


Finally,  an  illustration  of  the  automatic 
comparison  capabilities.  The  Table  Manipulator 
determines  the  amount  of  fuzz  in  the  table  and 
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16.  DEATHS  FROM  SPORT  PARACHUTING, 
by  JUMP  EXPERIENCE  and  by  YEAR 

YEAR 


Number 
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1975 
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14 

15 

14 

« 

« 

« 
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7 

4 

7 

75-199 

8 

>  2 

«  10 

« 

« 

« 
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15 

9 

10 

» 

» 

Unreported 

0 

2 

0 

SOURCE:  Metropolitan  Life  Insurance  Company, 

Stat.  Bull.: 3  p  4  (1979) 

NOTE  1:  03/ 13/86, NC 

NOTE  2:  comparisons  cut  at  3.1  and  6.2 
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This  paper,  adapted  from  a  recent  talk  by  the 
late  James  L.  Dolby,  was  prepared  for  presen¬ 
tation  by  Nancy  Clark  and  William  H.  Rogers. 

It  represents  theoretical  developments  by 
Dolby  and  Clark,  with  a  focus  on  Dolby's  work. 
Because  the  paper  was  written  in  two  halves, 
the  first  by  Clark  and  the  second  by  Rogers, 
it  reflects  a  shift  in  perspective.  We  felt 
it  appropriate  to  retain  the  two  perspectives 
to  give  the  reader  some  flavor  of  the  cross- 
disciplinary  nature  of  the  Language  of  Data. 

The  Language  of  Data  stems  from  some  practical 
problems  that  affect  more  than  one  discipline. 

Most  of  these  problems  are  old.  Analysts  have  to 
work  with  data  that  are  so  poorly  identified  that 
there  is  often  no  clue  to  their  ancestry,  let 
alone  to  the  phenomenon  they  represent.  One  prob¬ 
lem  is  that  this  information  is  not  documented — or 
at  least  not  documented  in  a  form  in  which  it  can 
be  passed  along. 

Another  problem  is  the  lack  of  any  mechanism 
for  transmitting  information,  in  unambiguous  form, 
through  an  entire  chain  of  communication.  Although 
the  communication  of  information  is  usually  treated 
as  a  presentation  issue,  for  data  the  problem  is 
more  fundamental.  At  every  point  in  the  communi¬ 
cation  chain  information  is  transferred  from  one 
stage  to  the  next  through  a  visual  intermediary, 
a  table  or  a  graph.  Thus  unambiguous  communica¬ 
tion  depends  on  a  precise  definition  of  both  the 
information  to  be  communicated  and  the  medium  of 
communication.  The  properties  that  convey  infor¬ 
mation  in  tables  and  graphs  are  discussed  in  Clark 
(1986).  This  paper  focuses  on  the  components  of 
information,  the  content  of  the  communication. 

The  general  sequence  of  events  for  data  is 
shown  by  the  diagram  in  Figure  1.  Each  of  these 
stages  involves  a  different  use  of  the  data,  and 
the  activities  at  any  stage  may  alter  the  content, 
generate  new  information,  or  lose  information. 

The  biggest  loss  of  information  is  at  the  data- 
collection  stage,  where  there  is  a  wealth  of 
descriptive  information,  but  no  criteria  that 
cover  the  essential  elements  of  description. 


Language  of  Data  Project,  Box  R,  Sausalito,  CA 
94966.  Research  supported  by  a  grant  from  the 
System  Development  Foundation  to  San  Jose  State 
University. 


There  is,  for  example,  no  formal  provision  for 
including  a  description  of  the  phenomenon  that 
was  measured  as  an  integral  part  of  the  data.  In 
fact,  the  reported  values  are  commonly  referred 
to  as  "the  data,"  a  definition  that  disconnects  the 
measurements  from  what  was  measured  at  the  first 
pass.  The  accompanying  documentation  may  describe 
the  measurement  process  in  copious  detail,  but  the 
essential  ingredients  have  not  made  it  into  the 
record. 

The  Formal  Structure  of  Data 

The  foundation  of  the  Language  of  Data  is  a 
formalism  for  incorporating  the  essential  elements 
of  description,  based  on  a  faceted  classification 
scheme  originally  developed  in  information  science. 
Most  classification  structures  are  hierarchic.  In 
biological  classifications,  for  example,  all  the 
species  are  organized  into  a  single  hierarchy.  In 
a  faceted  structure  each  object  or  event  is  iden¬ 
tified  in  terms  of  a  set  of  descriptors,  each  rep¬ 
resenting  a  different  facet  of  that  phenomenon. 

The  Dolby  model  is  based  on  a  facet  structure 
originally  designed  as  a  universal  scheme  for 
library  classification,  in  which  the  contents  of 
all  documents  were  described  in  terms  of  five 
fundamental  dimensions  of  description — 

Observer,  Matter,  Function,  Space,  and  Time 

with  the  author  name  added  for  unique  identifi¬ 
cation.  In  less  formal  terms,  these  categories 
are  the  standard  Who,  What,  How,  Where,  and  When 
of  reporting.  They  are,  in  fact,  the  minimum 
requirements  of  description  for  any  reported 
event.  The  application  of  this  model  to  scien¬ 
tific  data  is  discussed  in  Dolby  (1983).  For 
data,  of  course,  a  further  level  of  specifica¬ 
tion  is  needed — the  aspect  observed  and  the 
values  of  the  observation. 

The  result  is  a  formal  definition  of  a  complete 
datum,  shown  in  Figure  2.  Dolby  defined  a  datum 
as  an  ordered  pair  consisting  of  an  observation 
component  and  a  descriptive  component — a  finite 
set  of  terms  that  identify  the  phenomenon  the  data 
represent  (Dolby  and  Clark,  1982).  Notice  that 
unique  identification  works  in  both  directions  in 
this  model.  The  set  of  descriptors  uniquely  iden¬ 
tifies  the  phenomenon  being  described.  The  part 


FIGURE  1  The  chain  of  communication  for  data 


DATUM  STRUCTURE 

I - 1 

Descriptive  identification 

I  I 

Source  _ Descriptive  component  Observation 

r  1 1  - 1 1 - 1 

1  to  n  Observer  Matter  Function  Space  Time  Aspect  Value 
FIGURE  2  The  formal  structure  of  data 


bracketed  as  "descriptive  identification"  refers 
to  unique  identification  of  the  values.  As  a 
storage  system,  faceted  classification  provides 
a  unique  address  for  every  number  in  the  data¬ 
base. 

Some  of  the  structural  details  are  easier  to 
see  if  the  descriptors  are  represented  as  an  un¬ 
organized  set,  as  shown  in  Figure  3a.  The  symbols 
0,  M,  F,  S,  and  T  refer  to  the  descriptor  categor¬ 
ies — Observer,  Matter,  Function,  Space,  and  Time — 
and  X  denotes  the  aspect  and  its  domain  of  values. 
The  classification  scheme  sorts  the  descriptive 
elements  of  data  into  five  facets.  Each  descrip¬ 
tor  belongs  to  some  variable  in  one  of  these 
facets,  which  specifies  its  meaning  in  a  particu¬ 
lar  context — the  difference,  for  example,  between 
a  camel's-hair  brush  and  a  camel's  hairbrush.  The 
result  is  a  set  of  descriptors,  each  defined  in 
terms  of  its  own  underlying  variable,  but  struc¬ 
turally  independent  of  each  other.  This  is  an 
exceedingly  useful  arrangement  for  later  manipu¬ 
lation  of  the  terms  during  analysis,  an  applica¬ 
tion  discussed  in  Rogers  (1986). 

From  the  standpoint  of  description,  however, 
we  still  have  only  a  set  of  descriptors — a  collec¬ 
tion  of  words,  each  with  a  cleanly  defined  meaning 
but  with  no  relationship  to  each  other.  Faceted 
classification  specifies  the  semantic  content  of 
the  description,  but  not  the  context.  To  connect 
the  data  to  the  phenomenon  they  represent,  we  need 
a  basis  of  organization  for  a  coherent  description 
of  the  phenomenon.  And  to  connect  the  observation 
to  the  phenomenon  under  study,  we  also  need  the 
formal  link  between  aspect  and  the  phenomenon. 

Neither  of  these  structures  lies  in  the  classi¬ 
fication  scheme.  However,  the  missing  elements 
are  provided  by  Clark's  development  of  the  role 
of  aspect  in  the  information  structure  of  tables 


(Clark,  1986).  The  first  link  relates  to  the  role 
of  the  observer  in  specifying  the  aspect  observed. 

In  any  description  of  reality  we  have  to  know 
the  perspective  from  which  the  objects  and  events 
are  being  viewed  in  order  to  tell  what  we  are 
looking  at.  In  a  direct  observation  the  original 
observer  determines  the  viewing  point.  Under 
these  circumstances  the  focus  of  the  observation 
might  be  any  aspect  of  the  phenomenon,  depending 
on  the  observer's  interest.  However,  data  are 
collected  about  a  phenomenon  of  interest  to  the 
data  collector.  In  this  case  the  original  obser¬ 
ver  is  responding  to  questions  framed  by  the  data 
collector,  in  connection  with  the  event  about 
which  the  data  collector  is  gathering  informa¬ 
tion.  Thus  the  only  vantage  point  represented 
even  in  raw  data  is  that  of  the  data  collector, 
the  person  known  as  "source."  The  identity  of 
the  original  observer,  the  witness  who  reported 
the  values,  is  part  of  the  formal  record  of  the 
data  collector's  observation. 

This  is  true  even  where  the  data  collector  and 
the  observer  are  the  same  person;  the  researcher 
frames  the  questions  in  one  role  and  records  the 
observations  in  another.  In  subsequent  publica¬ 
tions  of  the  data  the  identity  of  both  observers 
may  be  available  through  the  source  chain.  It  is 
the  author  of  the  current  source  document,  how¬ 
ever,  who  specifies  the  aspect  of  the  data  being 
discussed. 

Notice  that  as  soon  as  we  get  into  the  struc¬ 
ture  of  description  the  role  of  aspect  begins  to 
emerge.  From  a  grammatical  standpoint,  aspect  is 
a  "relative  term" — one  that  requires  an  antece¬ 
dent.  The  distinction  between  "aspect  measured" 
and  "aspect  discussed,"  for  example,  is  in  effect 
a  transfer  of  attention  from  the  object  that  was 
measured  to  the  object  of  study,  the  topic  term  in 


FIGURE  3  Descriptive  structure  of  a_  datum 
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FIGURE  4  The  role  of  aspect  in  classification 


the  descriptor  set.  However,  aspect  also  implies 
a  part/whole  relationship — the  aspect  of  the  event 
that  is  the  focus  of  the  observation,  and  in  par¬ 
ticular,  the  one  about  to  be  discussed  in  further 
detail.  Thus,  from  the  standpoint  of  faceted 
classification,  aspect  is  merely  further  speci¬ 
fication  in  the  dimension  of  interest. 

The  data  structure  in  Figure  2  is  shown  at  the 
top  of  Figure  4,  represented  as  an  ordered  pair. 
When  we  start  with  the  phenomenon  instead  of  the 
values,  the  observation  component  shows  up  below 
the  descriptive  component  as  a  direct  expansion 
from  the  topic  term  in  the  descriptor  set.  In 
this  case  the  topic  of  the  discussion  is  educa¬ 
tion,  and  the  specific  topic  is  the  aspect  of 
education  described  by  the  data,  SAT  scores.  The 
values  of  interest,  at  the  next  level,  are  the 
recorded  values,  compared  to  other  possible  sets 
of  values.  From  the  bottom  up,  of  course,  aspect 
is  also  the  name  of  the  variable.  Thus  the  val¬ 
ues  of  the  variable  are  part  of  a  fully  connected 
structure,  describing  the  particular  aspect  of 
education  that  was  observed. 

Aspect  also  serves  as  an  important  structural 
link  between  tables  and  graphs.  However,  of  more 
immediate  interest,  its  role  as  part  of  the  topic 
structure  gives  us  a  basis  for  using  descriptor 
sets  to  summarize  the  information  structure  of  the 
data  as  well  as  the  content. 


The  Classification  Process 

The  purpose  of  an  initial  classification  is  to 
provide  the  analyst  with  an  accurate  representa¬ 
tion  of  the  author's  meaning,  as  a  starting  point 
for  analysis.  This  means  a  consistent  picture  of 
the  author's  descriptive  structure,  not  the  clas¬ 
sifier's,  and  not  the  analyst's;  that  comes  later. 

Figure  5  shows  the  information  requirements  for 
faceted  classification.  Although  the  source  chain 
is  technically  a  part  of  the  observer  facet,  the 
two  source  entries  are  sorted  out  here  to  keep  the 
event  structure  straight.  The  source  entries  refer 
to  the  events  of  publication  and  data  collection, 
and  the  kernel  set — the  next  five  descriptors,  plus 
aspect  and  its  domain — refer  to  the  event  described 
by  the  data.  The  last  entry  is  also  a  trace  back 
to  a  prior  event. 

In  most  cases  the  secondary  analyst  is  working 
with  variables  that  have  passed  through  unknown 
hands  and  may  be  in  various  degrees  of  removal 
from  the  original  source.  However,  the  only  de¬ 
scriptive  structure  we  can  classify  is  the  one  we 
can  see,  the  current  representation  by  the  author 
of  the  current  source  document.  The  first  source 
entry  therefore  identifies  the  author  whose  de¬ 
scription  is  represented  by  the  classification. 

The  same  author  descriptor  in  both  source  cate¬ 
gories  identifies  the  data  as  primary  data.  Note, 


FIGURE  _5  Information  requirements  for  classification 
From  Clark,  Classification  Procedures  (1984) 
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ATTITUDES  ON  SOCIAL  CONTROL 


Current  source:  NORC;  July  1984 

Original  source:  NORC;  spring  1983 


Adult  member  of  household 
Adult  member  of  household 
Social  control 
U.S. /world 
Current 

Attitudes  [-*■  Social  control  (F)] 

Civil  liberties,  sexual  behavior, 
women's  rights,  criminal  justice,  vio¬ 
lence,  religion,  suicide/euthanasia, 
economic  controls  (national  spending) 


•Aspect:  Same  as  current  aspect 


Observer: 

Matter: 

function: 

Space: 

Time: 

Aspect: 

Domain: 


(aj  Descriptor  set  for  a  main  topic 


however,  that  the  time  of  publication  is  not  the 
same  thing  as  the  time  of  data  collection.  The 
aspect  entry  at  the  bottom,  once  it  is  captured 
in  the  primary  data,  specifies  the  original  con¬ 
text  in  which  the  data  were  collected. 

Although  the  formalism  is  defined  at  the  datum 
level,  the  classification  scheme  extends  to  the 
level  of  a  whole  database  or  any  level  in  between. 
At  the  highest  level  the  descriptor  categories 
contain  what  are,  in  essence,  the  title  elements 
for  the  whole  database,  with  the  purpose  of  data 
collection  as  the  function  descriptor  and  the  set 
of  main  topics  as  the  domain  of  values.  The  next 
breakout  is  an  expansion  on  each  of  the  main  top¬ 
ics,  and  the  domain  at  that  level  lists  the  sub- 
topics  at  the  level  below. 

For  example,  one  of  the  main  topics  in  the  Gen¬ 
eral  Social  Survey  is  attitudes  on  social  control, 
shown  in  Figure  6<j.  The  GSS  is  an  annual  survey 
consisting  of  about  300  questions,  designed  as  a 
program  of  social-indicator  research.  The  data¬ 
base  consists  entirely  of  primary  data,  and  the 
data-collection  agency,  the  National  Opinion 
Research  Center,  is  the  author  of  the  codebook. 

The  respondents  were  adult  members  of  households, 
the  observer  descriptor.  They  were  being  asked 
about  themselves,  so  at  this  level  of  generality 
they  are  also  the  matter  component,  and  the  func¬ 
tion  descriptor  is  social  control.  The  variables 
cover  domestic  and  world  events,  as  shown  by  the 
space  descriptor,  and  the  time  was  current,  the 
respondents'  attitudes  at  the  time  of  interview. 

The  aspect  entry  consists  of  two  terms — the 
aspect  term  and  a  pointer  back  to  the  topic  term 
in  the  main  descriptor  set.  These  two  descrip¬ 
tors,  plus  the  arrow,  specify  the  focus  of  the 
discussion,  the  underlying  question  the  data  were 
designed  to  answer.  The  topic  term  is  a  function, 
social  control;  therefore  the  subsets  of  this  top¬ 
ic  are  also  functions — civil  liberties,  sexual 
behavior,  and  so  on. 

Figure  6b_  shows  a  descriptor  set  for  one  of  the 
variables  in  this  subset.  At  the  variable  level 
the  domain  describes  the  set  of  values.  At  this 
level,  if  each  variable  is  visualized  as  the  set 
of  answers  to  a  question,  it  should  be  possible  to 


(b)  Descriptor  set  for  a  variable 
CODE  NAME:  GRASS 

Current  source:  NORC;  July  1984 
Original  source:  NORC;  spring  1983 

Observer:  Adult  member  of  household 

Matter:  Marijuana 

Function:  Legalization 

Space:  Continental  U.S. 

Time:  Current 

Aspect:  Opinion  [—*•  Legalization  ( F) ] 
Domain:  Should,  should  not,  don't  know,  no 
answer,  not  applicable 

•Aspect:  Same  as  current  aspect 


FIGURE  6  Descriptor  sets  for  a  main  topic  and  a_ 
variable  in  the  General  Social  Survey 


tell  from  the  descriptor  set  for  the  variable  what 
the  question  was  and  who  answered  it. 

The  descriptor  sets  in  Figure  6  contain  a  dense 
pack  of  information,  including  some  that  is  not 
obvious.  For  example,  not  all  the  information  is 
in  the  words.  The  visual  grouping  of  the  categor¬ 
ies  is  part  of  the  information  structure,  and  in  a 
highly  condensed  description  this  visual  informa¬ 
tion  is  essential  for  rapid  comprehension.  Once 
the  analyst  is  familiar  with  the  data,  the  infor¬ 
mation  at  this  level  is  no  longer  needed — except 
by  the  next  user.  At  a  later  stage  the  descrip¬ 
tors  are  converted  to  a  different  display  form 
for  manipulation.  This  application  is  discussed 
in  Rogers  (1986). 

Descriptor  sets  can  also  be  used  to  summarize 
the  contents  of  a  table.  The  table  in  Figure  7 
contains  several  function  terms — deaths,  sport, 
parachuting,  jumps.  However,  the  event  being 
described  is  deaths  from  sport  parachuting.  The 
matter  descriptor  is  the  entities  involved  in 
this  event,  the  parachutists.  The  space  and  time 


FIGURE  7  Descriptive  elements  in  a  table 


DEATHS  FROM  SPORT  PARACHUTING 


Number  of  jumps 

1973 

1974 

1975 

1-24 

14 

15 

14 

25-74 

7 

4 

7 

75-199 

8 

2 

10 

200  or  more 

0 

2 

0 

Source:  Velleman  and  Hoaglin,  ABCs  of  EDA, 
p.  224.  Data  from  Metropolitan  Life  Insurance 
Company,  Statist.  Bull.  60,  no.  3  (1979); 
reprinted  by  permission. 
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descriptors  identify  the  place  and  time  of  occur¬ 
rence.  In  this  case  the  space  might  be  either 
U.S.  or  world;  however,  unless  tH.s  information 
is  stated  elsewhere  in  the  source  document,  the 
appropriate  descriptor  is  "unspecified."  The  time 
descriptor  is  the  time  period  covered  by  the  data. 
The  source  note  gives  both  the  current  source  and 
the  original  source,  but  who  reported  the  values? 
In  secondary  data  the  original  observer  is  often 
a  missing  element.  The  values  are  counts,  so  the 
aspect  is  simply  number — number  of  deaths. 

In  this  data  set  the  anonymous  observer  does 
not  affect  most  uses  of  the  data.  In  medical  data 
the  identity  of  the  observer  would  be  critical;  it 
is  important  to  know  whether  the  responses  were 
supplied  by  the  doctor  or  the  patient.  The  ambi¬ 
guity  in  the  space  facet,  however,  might  easily 
lead  to  misinterpretation  of  the  data.  In  working 
with  existing  data,  descriptor  sets  are  a  useful 
tool  for  pinpointing  what  is  and  is  not  known 
about  the  data. 

Descriptor  sets  are  also  a  useful  tool  for 
determining  whether  two  data  sets — or  two  vari¬ 
ables — can  in  fact  be  compared.  Two  data  are 
"simply  comparable"  if  they  differ  in  only  one 
descriptor  and  the  differing  descriptors  are  in 
the  same  facet  (Dolby  and  Clark,  1982).  In  the 
first  row  of  Figure  7,  for  example,  the  only 
source  of  variation  is  time;  jump  experience  is 
common  to  both  data,  and  all  the  other  descriptors 
are  constant  over  the  set.  By  the  same  token,  in 
the  1973  column,  all  the  data  have  identical  de¬ 
scriptors  except  for  jump  experience,  in  the 
function  facet.  If  one  were  interested  in  a 
comparison  of  deaths  from  sport  parachuting  in 
1973  with  U.S.  motorcycle  deaths  in  that  year, 
pairing  off  the  descriptors  in  each  category  would 
identify  a  possible  mismatch  in  the  space  facet, 
and  hence  a  confounded  comparison. 
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DEATHS  FROM  SPORT  PARACHUTING  ! 

[by  jump  experience  and  year)  I 

Year  \ 


Total, 


Number  of  jumps[?] 

1973 

1974 

1975 

1973-1975 

1-24 

14 

15 

14 

43 

25-74 

7 

4 

7 

18 

75-199 

8 

2 

10 

20 

200  or  more 

0 

2 

0 

34 

All  experience 
levels 

44 

32 

41 
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Source:  Velleman  and  Hoaglin,  ABCs  of  EDA,  p.  224. 
Data  from  Metropolitan  Life  Insurance  Company, 
Statist.  Bull.  60,  no.  3  (1979);  reprinted  by 
permission. 

FIGURE  jl  Aggregation  of  descriptors 


The  medium  of  communication  for  data  is  a  vis¬ 
ual  representation.  The  communication  of  meaning 
therefore  depends  on  the  visible  counterparts  of 
these  content  elements.  In  tables,  as  in  verbal 
language,  the  syntactic  relationships  of  the  terms 
are  implicit  in  the  terms  being  related.  Thus  in 
a  table  the  verbal  elements  carry  the  burden  of 
communication: 

Verbal  implicit  implicit 

elements  +  relationships  =  meaning 


Data  Relationships 

The  operations  on  data  are  discussed  below,  but 
in  brief,  for  every  algebra  that  operates  on  the 
numbers,  there  is  a  corresponding  algebra  that 
operates  on  the  descriptors.  The  total  column  in 
Figure  8,  for  example,  is  an  aggregation  over  the 
time  period  as  well  as  the  number  of  deaths.  The 
total  in  the  stub  raises  some  questions  about  table 
construction.  The  stubhead  is  number  of  jumps, 
but  the  subject  of  the  table  is  "jump  experience," 
and  the  aggregation  is  over  categories  of  jump 
experience,  not  number  of  jumps. 

Whereas  the  classification  scheme  provides  the 
semantic  content  of  the  Language  of  Data,  the  re¬ 
lationships  in  a  data  set  constitute  its  syntactic 
structure.  Both  components  together  produce  mean¬ 
ing,  the  information  derived  from  data.  The  result 
in  tables  is  a  set  of  "data  sentences"  that  follow 
the  grammatical  rules  of  ordinary  prose  (Dolby  and 
Clark,  1982).  At  the  generic  level,  however,  the 
components  of  information  can  be  stated  as 

Data  operational 

elements  +  relationships  =  meaning 

The  set  of  possible  statements  in  a  table  or  graph 
is  limited  by  the  allowable  operations  on  data. 

In  the  Language  of  Data,  therefore,  the  meanings 
are  limited  to  statistical  meanings. 


In  graphs  this  burden  falls  instead  on  the  rela¬ 
tional  component: 

Visual  visible  visible 

elements  +  relationships  =  meaning 

This  part  of  the  theoretical  framework,  a  general 
theory  of  data  representation,  is  discussed  in 
more  detail  in  Clark  (1986). 

First,  however,  there  are  some  other  pieces  to 
cover.  The  picture  so  far  has  been  a  bird's-eye 
view  of  the  communication  diagram  in  Figure  1. 

The  remainder  of  the  discussion  is  a  description 
of  the  Language  of  Data  from  a  more  familiar  per¬ 
spective,  that  of  the  data  user. 

The  Communication  Problem  from  the  Analyst's 
Perspective 

Each  person  in  the  chain  of  communication  for 
data  needs  to  understand  what  happened  earlier  in 
the  chain.  Analysis,  a  form  of  self-communica- 
tion,  is  a  key  part  of  this  sequence.  However, 
each  member  of  the  chain  has  his  or  her  own  per¬ 
spective.  If  Dolby  writes  a  paper  about  energy 
data  which  you  read,  you  need  to  know  Dolby's 
frame  of  reference.  You  want  to  know  that  he 
started  with  data  published  by  the  Energy  Infor¬ 
mation  Administration,  the  original  source,  and 
you  want  to  know  that  the  observers  who  supplied 
the  values  were  the  energy  companies.  If  you  dig 
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deeper,  you  may  also  want  to  know  how  Dolby  got 
his  numbers.  How  did  he  do  the  computations? 

What  other  decisions  did  he  make  about  the  data 
that  affected  the  outcome  of  his  analysis? 

We  will  discuss  the  chain  of  communication  from 
three  perspectives:  the  careful  reader  (who  could 
be  anyone),  the  secondary  analyst  (in  this  case, 
an  analyst  who  starts  with  the  derived  data  and 
has  weak  access  to  the  original  source),  and  the 
primary  analyst  (who  has  primary  data  and  strong 
access  to  the  original  source). 

The  careful  reader  will  want  to  know  where  the 
numbers  were  obtained.  This  always  includes 
"source"  information,  but  a  careful  reader  will 
also  want  to  know  how  the  numbers  were  calculated 
and  what  the  units  are.  He  or  she  may  also  re¬ 
quire  a  detailed  understanding  of  the  row  heads 
(the  stub)  and  the  column  heads.  Are  all  the 
possible  row  and  column  categories  represented? 
What  kinds  of  summary  statistics  were  used?  What 
are  the  omissions? 

If  the  reader  is  also  a  data  analyst,  he  will 
want  an  understanding  of  the  link  between  the 
microdata  and  the  aggregated  data  in  the  table  and 
will  be  curious  about  what  was  not  said,  as  well 
as  what  was  said.  What  did  the  author  of  the  data 
find  along  the  way  that  influenced  his  choices? 
Even  better,  what  hidden  assumptions  was  the 
author  operating  under?  The  analyst  may  want  to 
revise  the  methods,  or  to  repeat  yesterday's  anal¬ 
ysis  with  today's  data.  He  will  want  to  combine 
his  own  data  with  the  table  to  shed  more  light  on 
trends  observed  in  the  table,  or  he  may  want  to 
manipulate  the  table  further  to  reveal  a  more 
complex  structure  in  the  data. 

If  a  secondary  analyst  has  access  to  a  micro¬ 
database,  he  will  repeatedly  ask  questions  like 
"What  is  in  this  data  set?"  or  "What  does  this 
particular  variable  mean?"  We  have  all  spent 
many  long  hours  poring  over  someone  else's  un¬ 
decipherable  codebook. 

Both  the  reader  and  secondary  analyst  live  at 
the  far  end  of  the  diagram  in  Figure  1  and  depend 
on  the  primary  analyst  to  supply  the  right  infor- 


FIGURE  9  Structure  of  a  Microdatabase 

the  operation  that  produces  the  contingency  table, 
is  a  good  example  of  analysis  that  starts  with 
microdata  and  produces  a  table  as  the  end  result. 
Tables  may  also  be  manipulated  into  new  tables,  a 
subject  discussed  further  in  Rogers  (1986). 

The  Statistical  Microdatabase 

At  this  point,  however,  let  us  take  a  new  look 
at  a  familiar  object,  the  statistical  microdatabase. 
The  top-down  classification  outlined  above  for  the 
General  Social  Survey  results  in  the  conceptual 
structure  of  a  microdatabase  shown  in  Figure  9. 

The  variables  are  organized  into  columns,  and  each 
number  is  comparable  to  each  of  the  other  numbers 
in  the  same  column.  Ideally  they  are  simply  com¬ 
parable — that  is,  the  descriptors  differ  in  only 
one  dimension  of  description.  The  one  descriptive 
dimension  that  differs  between  cases  corresponds 
to  the  stub. 

The  following  examples  are  two  data  that  Rand 
might  collect  as  part  of  its  medical  studies: 

Source:  Rand 

Observer:  Rand  Nurse 
Matter:  Mary  Jones 


mation  in  a  usable  form. 

The  primary  analyst  has  parallel  but  different 
concerns.  He  is  often  in  charge  of  a  very  large 
microdatabase  (in  some  Rand  work  there  may  be 
10,000  raw  variables  and  1,000  derived  varia¬ 
bles).  He  must  be  able  to  track  down  and  elimi¬ 
nate  errors  caused  by  programming  mistakes  or 
misunderstood  instructions. 

Part  of  the  problem  relates  to  storage  and 
retrieval.  Primary  analysts  usually  work  in  a 
constantly  changing  setting,  with  data  arriving 
continually  and  moving  through  various  stages  of 
processing.  They  must  be  able  to  find  derived 
variables  created  by  their  colleagues  in  order 
avoid  reinventing  the  wheel  (or  worse,  a  slightly 
different  wheel).  Communication  is  at  various 
times  communication  to  other  audiences,  self¬ 
communication,  or  communication  with  colleagues. 

How  does  the  Language  of  Data  theory  fit  into 
these  problems?  We  began  with  a  definition  of  the 
datum  and  then  we  examined  this  structure  in  terms 
of  microdatabases  and  tables.  We  could  think  of 
a  microdatabase  as  a  form  of  table.  What  are  the 
differences  between  them?  The  data  in  tables  are 
aggregated.  The  values  are  usually  the  downstream 
product  of  analysis,  the  results  of  statistical 
manipulations  of  microdatabases.  Cross-tabulation, 


Function:  Systolic  Blood  Pressure 
Space:  Rand  Examination  Center 

Time:  8  am,  January  6,  1986 

Aspect:  Pressure,  mm  Hg  (F)  Systolic  BP 

Domain:  120 

Source:  Rand 

Observer:  Rand  Nurse 

Matter:  John  Smith 

Function:  Systolic  Blood  Pressure 

Space:  Rand  Examination  Center 

Time:  8  am,  January  6,  1986 

Aspect:  Pressure,  mm  Hg  (F)  Systolic  BP 

Domain:  135 

These  are  two  simply  comparable  data  about  the 
systolic  blood  pressure  of  Mary  Jones  and  John 
Jraith  (the  matter  facet).  All  the  other  descrip¬ 
tors  are  the  same,  and  the  two  data  share  a  common 
aspect  and  unit  of  measurement.  The  value  of  the 
measurement  is  placed  in  the  domain. 

To  represent  this,  we  create  a  header  for  each 
variable  which  consists  of  the  descriptors  common 
to  that  column.  Any  descriptor  that  corresponds 
to  the  stub  is  elliptical.  The  stub  has  a  descrip¬ 
tor  set  which  describes  the  sample.  The  result  is 
a  picture  like  that  in  Figure  10. 


Source: 

Rand 

Rand 

Obsvr: 

Survey  Ctr. 

Rand  Nurse 

Matter: 

Patient 

Patient 

Functn: 

Identity 

Systolic  Blood  Pressure 

Space: 

Universal 

Rand  Examination  Center 

Time: 

Universal 

8  am,  January  6,  1986 

Aspect : 

Name  (F)  Id. 

Pressure,  mm  Hg  (F)  Syst  BP 

Domain: 

Alphabetic 

120,..., 155 

Values: 

Mary  Jones 

120 

John  Smith 

135 

FIGURE  10  Structural  details  of  a  microdatabase 


The  individual  patients  (Mary  Jones,  John 
Smith)  are  listed  in  the  stub  and  identified  by 
category  name  as  "patient"  in  the  matter  slot.  A 
second  variable  is  also  shown.  Notice  that  for 
this  variable  "patient"  is  also  the  observer.  To 
analyze  this  variable  with  the  first  one  we  must 
assume  that  it  does  not  matter  that  a  different 
observer  observed  each  patient,  and  that  the  ob¬ 
servations  were  made  at  a  different  time  on  each 
patient.  These  data  are  not  simply  comparable. 
Otherwise,  if  we  think  time  or  observer  affect  the 
quality  of  the  observation,  we  have  to  include 
time  or  observer  information  in  our  model,  which 
complicates  the  analysis. 

So  far,  we  have  a  set  of  one-way  tables  with  a 
common  stub.  A  statistical  database  usually  has 
an  added  piece  of  structure  represented  as  the 
attic  in  Figure  10.  This  is  the  topic  structure, 
which  cements  the  database  together.  The  topical 
organization  captures  the  relationships  between 
variables  intended  at  the  time  of  data  collection. 
In  a  secondary  analysis,  we  might  use  the  data  for 
a  new  purpose,  leading  to  an  entirely  different 
organization  of  the  variables,  and  hence  a  new 
topic  structure.  Curiously,  we  have  found  in 
experiments  with  the  General  Social  Survey  that 
statisticians  tend  to  invent  unique  topic  struc¬ 
tures — unique  from  each  other  and  different  from 
those  produced  by  the  nonstatistical  classifiers, 
who  tend  to  agree.  Perhaps  this  reflects  analytic 
structuring  even  without  numbers. 

The  topic  structure  is  a  hierarchy,  and  hier¬ 
archic  structure  also  shows  up  in  the  descriptors 
themselves.  Gender  is  divided  into  female  and 
male.  Races  may  be  partitioned  White,  Black,  Ori¬ 
ental,  American  Indian,  and  Other.  Decades  are 
divided  into  years,  then  months,  then  days.  A 
good  deal  of  statistics  is  devoted  to  the  discov¬ 
ery  and  identification  of  classification  categor¬ 
ies.  Problems  arise  in  comparing  data  from  two 
sources  that  use  different  partitions  of  the  same 
concept:  for  example,  fiscal  years  and  calendar 
years.  We  need  the  equivalent  of  relational 
structures  to  handle  this  situation. 

We  also  need  concepts  of  balances  and  forbidden 
values.  For  example,  the  inventory  at  the  begin¬ 
ning  of  the  month,  plus  additions  and  subtractions, 
should  give  the  inventory  balance  at  the  end  of 
the  month.  The  whole  is  the  sum  of  its  parts,  so 
proportions  should  sum  to  1.  For  most  data  a  var¬ 
iable  such  as  gender  takes  only  two  values,  and  if 
there  are  more  than  two  something  is  wrong  with 
the  data.  In  one  data  set,  for  example,  the 


Rand 

Patient 

Patient 

Satisfaction  w  MD 
U.S. 

Answered  Quest. 
Level  of  Satisf. 
l=Very  Satisfied,.. 
5=Very  Unsatisfied 

1 
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reported  values  for  gender  were  M,  F,  P,  and  Q. 

In  the  same  data  set  gender  also  differed  depend¬ 
ing  on  the  observer.  Because  existing  systems  do 
not  embody  concepts  of  domain  and  observer ,  this 
was  a  major  problem  to  uncover  and  repair. 

How  do  we  formalize  the  relational  structures? 
One  way  is  through  a  thesaurus  and  glossary.  The 
thesaurus  holds  the  possible  classification  struc¬ 
tures  and  the  glossary  contains  the  specialized 
definitions.  It  is  very  important  in  each  appli¬ 
cation  that  the  user  be  able  to  add  his  or  her 
own  thesaurus  and  glossary  information. 

Finally,  we  need  a  record  of  interactions  with 
the  data.  The  need  for  a  time-stamped  record  of 
the  editing  and  revision  process  is  something  that 
survey  data  processing  centers  and  fiscal  admini¬ 
strators  have  known  all  along,  but  the  process 
seems  to  be  foreign  to  analysts.  Nevertheless, 
this  is  the  only  way  we  can  gain  access  to  assump¬ 
tions  that  the  analyst  may  not  have  been  aware  of. 


Up  to  this  point  we  have  been  talking  about  a 
theory  of  good  documentation,  ideally  as  part  of 
the  database,  not  external  to  it.  The  classifica¬ 
tion  scheme  provides  the  initial  documentation. 
However,  the  initial  classification  is  essentially 
static:  it  is  a  description  of  the  given  data  in 
the  context  in  which  they  were  collected.  This 
documentation  has  to  reach  the  analyst,  but  each 
analyst  is  working  in  a  different  context,  and 
the  analyst's  description  of  the  data  also  has  to 
be  documented.  Incorporating  this  documentation 
in  the  database  as  well  increases  the  likelihood 
that  it  will  be  passed  along  and  supplemented  by 
others. 

We  also  want  to  go  deeper  than  documentation. 

A  Language  of  Data  computing  system  should  be  able 
to  manipulate  the  descriptive  component  of  the 
data  as  well  as  the  numbers.  To  get  to  dynamics 
we  need  to  understand  what  statistical  manipula¬ 
tions  are  like.  We  have  already  discussed  simply 
related  data — data  that  have  differing  descriptors 
in  only  one  facet.  We  can  go  on  to  the  applica¬ 
bility  conditions  for  aggregation: 

1  The  data  must  constitute  a  simply  related  data 
set  (differ  in  only  one  descriptive  dimension) 

2  The  set  of  items  to  be  aggregated  must  form  an 
exhaustive  partition  of  the  concept  to  which 
they  are  being  aggregated. 


3  The  aspect  must  be  a  measure  (in  the  formal 

sense) . 

A  valid  proportion  is  the  ratio  of  a  datum  to  a 
valid  aggregate  of  which  it  is  a  part,  and  the  do¬ 
main  is  further  restricted  to  be  nonnegative  with 
a  meaningful  zero  (this  collection  of  restrictions 
is  sometimes  called  "ratio  measure").  If  we  knew 
that  a  certain  proportion  had  passed  the  applica¬ 
bility  conditions,  we  could  be  sure  it  obeyed  the 
value  restrictions. 

As  a  more  complex  example,  monotonicity  re¬ 
quires  two  simply  related  data  sets.  The  descrip¬ 
tor  that  varies  in  ^ach  of  the  data  sets  must  be 
the  same  descriptor,  and  the  two  data  sets  must 
match  up  on  this  descriptor.  In  other  words, 
they  must  form  a  table  with  a  common  stub.  For 
the  numbers,  we  have  the  usual  condition.  The 
second  data  set  (Y)  is  monotone  increasing  with 
respect  to  the  first  (X)  if  X2  >  XI  implies  Y2  > 

Y1  for  all  possible  pairs  (XI, X2). 

What  about  more  complex  computations?  First, 
we  must  recognize  that  statistical  manipulations 
take  place  at  several  levels.  At  the  microdata 
level  we  may  manipulate  one  or  more  variables  to 
produce  derived  data.  The  computation  may  be  as 
simple  as  changing  "No,  I  did  not  go  to  the  doc¬ 
tor,  if  yes,  how  many  visits?"  to  "zero  visits." 
It  may  be  a  standard  balance  equation:  net  income 
equals  gross  income  minus  taxes.  Population  di¬ 
vided  by  area  gives  population  density,  but  mean 
January  temperature  divided  by  land  area  gives 
nothing.  We  might  be  able  to  see  this  from  the 
units,  or  better,  from  the  glossary. 

Derivation  at  the  microdata  level  also  may 
involve  comparing  data  from  different  sources:  A 
patient  is  adult-onset  diabetic  if  the  doctor  says 
he  has  diabetes  and  the  patient  says  the  symptoms 
appeared  after  he  was  thirty.  Or  the  derivation 
may  involve  complex  estimation  techniques,  such  as 
residuals,  that  depend  on  the  whole  data  set. 

In  the  transition  from  microdatabases  to  tables 
the  manipulations  are  the  standard  statistical 
techniques — mostly  confirmatory  analysis.  Classi¬ 
fication  structures  for  tables  are  derived  from 
values  in  the  microdata  or  created  from  the  method 


of  analysis  (for  example,  the  standard  ANOVA 
table) . 

Finally,  we  have  manipulations  on  tables  that 
produce  other  tables.  These  manipulations  are 
mostly  exploratory  (since  tables  usually  contain 
aggregated  data).  Dolby  has  cataloged  manipula¬ 
tions  on  tables  in  three  groups:  select,  arrange, 
and  transform.  Here  is  where  classification  is 
most  important:  does  a  rearrangement  of  a  table's 
rows  violate  the  aggregation  structure?  Is  a 
particular  column  total  a  reasonable  aggregation? 
Is  a  subtraction  a  valid  comparison?  It  depends 
on  the  data  as  well  as  the  nature  of  the  manipu¬ 
lation. 

Summary 

The  general  theory  represented  by  the  Language 
of  Data  has  a  number  of  implications  for  the  com- 
muication  of  information,  a  documentation  system 
for  databases,  storage  and  retrieval  systems, 
computational  aids  for  analysis,  and  the  construc¬ 
tion  of  survey  instruments.  More  important,  it 
provides  a  general  framework  for  a  systematic 
approach  to  the  communication  of  information 
through  data. 


REFERENCES 

Dolby,  James  L. :  Meaning  from  Data:  Implications 
for  Data  Analysis  and  Database  Management  Sys¬ 
tems,  presented  at  the  149th  National  Meeting 
of  the  American  Association  for  the  Advance¬ 
ment  of  Science,  Detroit,  1983. 

Dolby,  James  L. ,  and  Nancy  Clark:  The  Language  of 
Data,  Language  of  Data  Project,  1982. 

Clark,  Nancy:  Tables  and  Graphs  as  Language, 

presented  at  the  18th  Interface  Symposium,  Fort 
Collins,  CO,  1986. 

Rogers,  William  H:  Implications  of  the  Language 
of  Data  for  Computing  Systems,  presented  at  the 
18th  Interface  Symposium,  Fort  Collins,  CO,  1986. 


103 


&smsmam 


INTELLIGENT  DATA  MANAGEMENT 


Henson  Graves,  San  Jose  State  University 
Ruth  Manor,  San  Jose  State  University  and  Tel-Aviv  University 


Abstract 

Intelligent  computer  support  for  statistical 
data  analysis  requires  a  system  in  which 
descriptive  information  is  represented  and  used 
deductively  to  answer  questions  from  data, 
definitions  and  assumptions .  The  knowledge 
representation  requirements  for  supporting  data 
analysis  include  flexibility  in  interacively 
introducing  changes  in  the  system  and  the 
capability  of  handling  data  revision  and  data 
discrepancy.  We  outline  a  formalism  for 
representing  descriptive  information  and 
auxilary  assumptions  for  data  analysis.  This 
formalism  is  currently  being  developed  and 
implemented  in  the  Algos  computational  system. 

1.  Introduction 

Data  analysis  involves  a  variety  of 
activities  whose  results  are  communicated 
between  individuals  with  very  different 
perspectives.  Much  of  the  information  that  data 
analysts  use  will  only  be  available  on  a 
computer.  Computer  systems  are  used  to  perform 
analytic  operations  on  data,  and  serve  as  the 
medium  for  conveying  the  results  of  analysis. 

Computer  systems  used  in  data  analysis  such 
as  data  base  systems  and  statistical  packages  do 
not  keep  sufficient  information  to  support 
analysis.  The  data  analyst  has  to  obtain  (and 
remember)  what  definitions  and  conventions  were 
used  to  produce  the  data.  As  the  chains  of  new 
data  derived  from  old  data  become  longer,  it 
becomes  even  more  important  for  computer 
information  systems  to  carry  the  descriptive 
information  necessary  for  the  determination  of 
the  meaning  of  data,  and  to  use  this  Information 
in  drawing  conclusions  from  data  and  in 


inference  capability  used  to  check  applicability 
conditions  and  to  perform  operations  on  the 
objects  of  analysis  (data  and  tables)  using, 
possibly,  auxiliary  information  (e.g., 
assumptions  and  definitions). 

2.  Data  Management  Requirements 

Computer  support  for  data  analysis  require 
dealing  with  interesting  knowledge  represen¬ 
tation  problems.  One  basic  problem  concerns 
what  are  the  primitive  objects  to  represent, 
i.e.,  what  is  a  datum  Dolby,  Clark,  1982). 
Although  discussions  of  data  bases  and  AI 
address  some  of  the  general  issues  of  data 
modeling,  i.e.,  how  to  represent,  they  hardly 
considert  what  primitive  statistical  entitles 
should  be  represented.  Data  analysis  must  deal 
with  the  problems  of  data  discrepancies  and  data 
revision.  These  are  species  of  the  more  general 
problems  of  reasoning  in  the  presence  of 
inconsistencies  and  temporal  and  context  depen¬ 
dency,  discussed  by  logicians  by  the  AI  community 
2.1  Statistical  Micro  Databases  and  Tables 
Too  much  data  with  too  little  information 
about  what  they  represent  is  a  rapidly  growing 
affliction  of  the  information  processing  world. 
Data  travel  through  a  chain  of  communication 
that  proceeds  from  the  first  steps  of  data 
collection,  through  the  processes  of  editing, 
revision  and  data  analysis,  to  presentation  and 
use  --  with  the  concurrent  need  to  store  and 
retrieve  at  each  interface  (Dolby,  1984a).  Data 
get  seriously  misinterpreted  as  a  result  of 
specific  ambiguities  regarding  who  collected  Che 
information,  which  event  was  measured,  where  and 
when  the  event  occurred,  what  was  measured  and 
how  it  was  measured.  To  correct  this  situation 


performing  operations  on  data  and  tables. 

This  need  for  developing  more  intelligent 
software  system  to  support  data  analysis  is 
matched  by  the  existence  of  Artificial 
Intelligence  (AI)  technology  which  can  be  used 
to  build  such  systems.  However,  current  efforts 
to  build  expert  data  analysis  systems  (Gale, 
Pregibon,  1983,  1985;  Portier,  Lai,  1983; 
Thisted,  1985a,  1985b),  as  well  as  discussions 
of  future  development  of  intelligent  software  in 
statistics  (Hahn,  1985),  focus  on  the  expertise 
of  data  analysts  rather  than  on  representing  the 
objects  of  analysis.  Data  analysis  is  a 
relatively  well  defined  activity,  which  makes 
designing  and  building  a  representation  system 
supporting  it  extremely  useful  in  solving 
knowledge  representation  problems. 

In  this  paper  we  describe  the  knowledge 
representation  requirements  needed  to  supprt 
data  analysis,  and  how  the  system  IDA  (in 
development)  satisfies  them  (Graves,  Manor, 

1985) .  IDA  makes  use  of  the  theory  of  data 
description  developed  by  the  Language  of  Data 
Project  (Dolby,  Clark,  1982),  and  it  is 
implemented  in  the  knowledge  representation 
system  Algos  (Graves,  Blaine,  1984,  1985). 

IDA  is  not  intended  as  an  "expert"  system  in 
the  sense  of  knowing  what  data  analysis 
activities  to  perform.  Rather,  its  intelligence 
lies  in  its  expressiveness  and  in  the  deductive 


analysts  must  have  systems  that  can  process 
descriptive  and  numerical  information  together. 

Typically,  data  is  gathered  by  distribution 
of  questlonaires  and  collected  into  a  micro¬ 
database.  The  data  in  the  micro-database  is 
analyzed  and  presented  in  a  summary  form  (tables 
and  graphs) .  Further  analysis  consists  of 
transformations  on  these  forms.  At  any  step  of 
this  sequence,  analysis  may  require  back¬ 
tracking,  revising  an  earlier  step  or 
restarting  at  an  earlier  step. 


Figure  1:  A  Micro- database 


Distribution  of  energy  to  user  sectors  in  the  O.S., 
1977-10,  as  reported  by  the  distributers 
(Source:  Department  of  Energy) 


End-Use 

Sector 

Distributer 

Year 

Amount 
in  Quad.  BTU 

Indus trial 

PG&E 

1977 

2.1* 

Indus trlsl 

PG&E 

1978 

2 . 57 

Industrial 

PGAE 

1979 

3.02 

Industrial 

PG&E 

1980 

3.91 

Industrial 

So. Cal 

1977 

2.5* 

Res&Com 

PG&E 

1977 

1.96 

Transport . 
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A  micro -database  (HOB)  Is  a  two-way  matrix 
(a.g..  Fig.  1.).  Analysis  of  a  MOB  leads  to  a 
table,  which  is  a  typical  vehicle  for 
communicating  and  manipulating  data.  Statistical 
tables  are  basic  Information  units.  Their 
utility  lies  in  their  representation  of 
collections  of  facts  about  related  phenomena, 
arranged  to  make  simple  comparisons  readily 
apparent,  as  well  as  that  the  applicability 
conditions  are  met.  For  example,  the  MDB 
presented  in  Fig.  1  might  have  served  as  a  basis 
of  the  table  Tl,  Fig.  2  (Dolby.  1984c). 

Figure  2:  Table  Tl 

US.  CONSUMPTION  OF  ENERCT 
BY  END-USE  SECTOR  and  BT  YEAR 


(Quadrillion  BTU'a) 


End-use  factor 

1977 

1978 

1979 

1980 

Industrial 

Resident iel  & 

29.024 

29.373 

31.551 

30 . 284 

coa*ereial 

27.569 

28.159 

27.462 

27.283 

Transportation 

19.735 

20.612 

19.950 

18.628 

Source:  Dolby,  1984c 

A  table  can  be  obtained  as  an  answer  to  a 
question,  e.g.,  "What  is  the  energy  consumption 
in  the  U.S.  in  1977-80,  by  end-use  sector  and  by 
year?"  and  it  is  used  to  make  comparisons  among 
data  and  to  study  how  comparisons  change  over 
time.  It  can  be  used  to  answer  questions  such 
as  "What  is  the  lndustral  energy  consumption  in 
the  U.S.  in  1977?"  by  extracting  from  the  table 
that  the  industrial  consumption  of  energy  in  the 
U.S.  in  1977  (in  Quadrillion  BTU's)  is  29.024. 
The  table  may  also  be  used  to  answer  the 
question  "What  is  the  total  end-use  energy 
consumption  in  the  U.S.  in  1977?".  However,  to 
compute  the  answer  to  this  question  correctly, 
requires  inference  on  the  basis  of  additional 
assumptions . 

We  want  a  computer  to  be  able  to  answer 
questions  like  these  (formulated,  of  course,  in 
che  appropriate  language)  correctly.  Hence,  we 
need  to  represent  In  the  system  the  numeric 
values  (e.g.,  29.024)  together  with  their 
associated  descriptive  information  (l.e..  Chat 
the  Industrial  consumption  of  energy  In  the  U.S. 
In  1977  (In  Quadrillion  BTU's,  as  reported  by 
Dolby  1984c)  Is  29.024.  Moreover,  for  the 
system  to  represent  the  meaning  of  the  data 
accurately,  we  need  also  to  represent  auxiliary 
information  (definitions,  classification,  etc.), 
which  may  not  be  explicit  in  a  table  display. 

For  Instance,  the  meaning  of  the  table  Tl 
depends  on  whether  the  years  are  defined  as 
calender  years  starting  on  January  1,  or  they 
are  defined  as  fiscal  years  starting  on  April  1. 
2.2  Expandability 

What  Is  needed  to  support  the  interactive 
nature  of  data  analysis,  Is  that  the  user  be 
able  to  add  or  change  assumptions  and 
definitions  (Including  definitions  of  data 
operations)  whenever  he  wishes.  In  designing 
the  system,  we  do  not  try  to  predict,  ell  the 
possible  definitions  and  assumptions  a  user  may 
need.  An  expandable  system  offers  the  user  the 
means  to  "engineer  the  changes  in  information" . 


2 . 3  Data  Discrepancy  and  Data  Revision 

A  significant  activity  of  data  analysts 
consists  of  resolving  inconsistencies  in  the 
data  they  use.  A  data  analyst  is  constantly 
searching  for  additional  information  from  the 
data  at  hand,  and  discrepancies  in  the  data 
often  give  him  clues  about  where  to  look  for 
information.  A  typical  cause  of  discrepancy  in 
data  is  the  amblgous  use  of  terms  whose  meanings 
chande  with  contexts.  A  resolution  of  the 
discrepancy  consists  of  identifying  the  contexts 
in  which  the  terms  should  be  Interpreted 
together  with  the  interpretation.  Whitmore 
(1984)  describes  the  following  example:  in  1979 
there  were  headlines  about  how  the  Department  of 
Energy  and  the  Bureau  of  the  Census  were 
reporting  different  amounts  of  oil  imported  into 
the  country.  The  amount  reported  by  the  DoE  was 
higher  by  close  to  7t  than  that  reported  by 
Census .  After  tracking  down  the  sources , 
Whitmore  concluded  that  "It  turns  out  -  for 
completely  legitimate  reasons  -  these  two 
government  agencies  were  using  different 
definitions  of  some  elementary  concepts.  These 
were  'oil',  'the  United  States'  and  'month'." 
Data  statements  and  tables  have  sources.  The 
source  can  be  viewed  as  identifying  the  context 
in  which  statements  should  be  Interpreted. 

Since  agencies  may  occasionally  revise  the 
definitions  they  use,  and  often  revise  their 
data,  it  la  important  for  the  data  analyst  to  be 
able  to  find  out  not  just  what  is  the  most 
reliable  information,  but  also  the  history  of 
the  revisions.  Hence,  an  intelligent  system 
should  enable  users  to  represent  data  revisions 
and  trace  its  evolution. 

3.  Knowledge  Representation  in  Algos 

The  basis  for  an  Intelligent  systems  is  a 
computational  system  which  can  represent  (user 
specified)  theories  about  some  domain.  To 
represent  the  domain  adequately,  the  theories 
must  have  a  rich  language  and  deductive 
capabilities,  we  employ  the  "logic"  approach, 
in  using  the  Algos  computational  system  (Graves, 
Blaine,  1983)  which  fasplements  a  deduction 
system  for  a  higher  order  function  calculus. 

Logical  languages  have  been  used  as  a 
paradigm  for  knowledge  representation  languages 
for  a  long  time  both  in  AI  (McCarthy,  Hayes, 
1969)  and  in  Database  theory  (Codd,  1970).  The 
traditional  formalization  of  the  Relational  Data 
Base  Model  represents  facts  as  sentences  in  a 
first  order  language.  Question  answering 
Involves  retrieving  the  answers  from  a  data  base 
which  is  viewed  as  a  model  (in  the  logical 
sense)  of  the  language. 

Our  problem  was  to  find  a  logical  system 
which  is  sufficiently  expressive  for  data 
analysis.  The  language  must  represent  entitles, 
relationships,  data  structures  (such  as  records, 
reports,  and  tables),  as  well  as  properties 
about  these  objects.  For  analysis,  the  language 
must  represent  algorithms  or  mathaaMtlcal 
functions.  For  reasoning  the  language  must 
represent  assumptions  used  in  reasoning  about 
the  domains  (e.g.,  the  assuaq>tion  that  a 
function  with  Boolean  arguments  is  an  additive 
measure).  Such  a  language  requires  quantifying 
over  functions,  and  is,  therefore,  higher  order. 
We  use  a  language  of  a  higher  order  function 
calculus  which  has  been  specifically  engineered 


co  represent  data  structures  and  algorithms. 

The  language  of  Algos  Is  based  on  topos 
theory  (Goldblatt,  1984)  which  is  expressively 
comparable  to  set  theory  and  has  been  suggested 
as  a  foundation  for  mathemetics  (Lavvere,  1976). 
The  difference  between  these  theories  lies  in 
their  choice  of  primitives .  Set  theory  is  built 
on  the  single  primitive  membership  relation. 
Topos  theory  is  built  on  the  primitive  notions 
of  function  (map)  and  type. 

3.1  Primitives  and  Definitions 

The  syntax  of  Algos  uses  elements  of 
mathematical  and  programming  language  notation. 
Algos  has  a  data  language  of  terms  which  are 
used  to  represent  the  various  kinds  of  data 
considered  here:  numeric  and  string  data, 
descriptive  data,  algorithms,  and  assumptions. 

It  has  commands  for  introducing  definitions  and 
assumptions,  and  for  making  Inquiries.  A 
collection  of  definitions  and  assumptions 
constitutes  a  knowledge  base  (theory)  of  the 
system.  The  Algos  system  uses  deductive 
inference  to  answer  questions  on  the  basis  of 
the  definitions  and  assumptions  in  the  knowledge 
base.  The  model  of  computation  used  is  term 
evaluation  (simplification)  which  is  a  special 
case  of  deduction.  Simplification  uses 
deductive  inference  rules  in  the  form  of  term 
reduction  rules  which  correspond  to  the 
different  kinds  of  term  constructions  (e.g., 
tuple,  functional  abstraction,  etc.).  For 
example ,  the  command 
simplify  2+2; 

evaluates  to  '4' .  A  formula  is  a  term  which 
evaluates  to  'true'  or  'false'.  Thus,  'a  O' 
is  a  formula.  We  can  declare  a  map,  p,  to  be  a 
formula  with  the  statement  'p  :  OMEGA'.  If 
’a'  is  undefined,  the  command 
simplify  a  >  0; 

cannot  be  simplified  and  returns  'a  >  O' . 

Users  can  add  names  and  definitions.  For 
example,  we  can  enter  definitions  by 
density  -df  count/area 

months  -df  (jan  feb  march  april  may  June 
July  aug  sep  oct  nov  dec) 

The  names  used  in  these  definitions,  'count', 
'area',  'jan',  'feb',  etc.,  need  not  have  been 
previously  defined  for  the  definitions  of 
'density'  and  'months'  to  be  legitimate. 

3.2  Some  Map  and  Type  Constructions 

Algos  has  a  product  type  construction  and  a 
corresponding  tuple  map  construction.  Products 
serve  as  record  types  and  tuples  serve  as 
records.  For  example,  an  employee  record  type 
may  be  introduced  with  the  declaration  ' EMPLOYEE 
-  product(NAME, AGE, SALARY) '  and  a  map  of  record 
type  with  the  declaration  'a  :  EMPLOYEE'.  We 
use  a  tuple  notation  to  specify  a  record 
a  -df  <John, 34, 3232 . 22> 

A  function  is  a  term  which  can  be  "applied" 
to  an  argument.  We  use  a  "lambda"  syntax  to 
specify  function  definitions.  For  example,  an 
algorithmic  definition  of  the  absolute  value 
function  can  be  expressed  by: 

absolutevalue  -df  (lam  x)(if  x  >-  0  then  x 

else  -x) 

This  function  can  be  applied  to  numeric 
arguments.  The  command  to  simplify 
' absolutevalue[ -3] '  results  in  the  system  using 
the  definition  to  return  '3'.  The  type  of  the 
function  is  the  expenential  type,  1 | X ,  where  I 


is  the  integer  type. 

In  addition,  Algos  has  lists,  numbers  and 
strings.  We  use  the  LISP  notation  for  lists, 
e.g.,  the  list  of  the  first  three  natural 
numbers  is  represented  as  '(012)'. 

Algos  has  the  empty  list,  'nil' ,  and  the  list 
operations  (as  in  LISP).  The  definition 

in  -df  (lam  a,l)(if  empty[l]  then  false 
else  (if  a  -  firstjlj  then  true 
else  in[a, rest[l) ) )) 

tests  if  an  element  is  a  member  of  a  list,  so 
simplify  in [ 3 , ( 1  2  3)]; 
evaluates  to  ' true ' . 

Algos  has  a  power  type  construction  which  is 
used  to  represent  relations.  We  use  'POW(X)' 
to  represent  the  power  type  of  a  type  X. 

Relations  on  a  type  correspond  to  subtypes. 
Further,  a  relation  corresponds  to  a  formula. 

For  example,  we  can  introduce  a  function  to 
represent  'is  a  male*  with  the  statement  'male  : 
HUMAN  ->  OMEGA' .  The  formula  'male'  determines 
a  relation  '(male)'  which  has  the  value  type 
P0W( HUMAN) .  This  relation  represents  the  data 
elements  of  HUMAN  having  the  property  of  being 
males . 

Many  data  operations  involve  aggregating  the 
values  of  measurements  of  parts  of  some  domain. 

We  use  relations  to  represent  the  "containment" 
and  partition  relationships.  For  each  type 
there  are  a  zero  (empty)  and  unit  relations,  and 
the  Boolean  operations  '+' ,  '*'  and  '-'  (union, 
Intersection  and  complement,  respectively). 

E.g.,  if  'a'  is  a  relation,  then  the  command  to 
simplify  'a+0'  returns  'a'.  The  definition 
sum  -df  (lam  l)(if  empty[l]  then  0 
else  first[l]+sum[rest[l] ] ) 
provides  a  function  to  total  lists  of  relations. 

In  order  to  represent  validity  conditions  for 

totaling  lists  we  need  to  represent  partitions. 
Informally,  a  list  (al  . . .  an)  partitions  a  if 
both  'a  -  sum[ (al . . .an) ] '  and 
'for  any  b,c  in  (al  ...  an)  not(b  -  0)  and 
(if  not  b-c  then  b*c  -  0)'  evaluate  to  'true'. 
Partition  is  defined  as  the  test: 

partition  -df  (lam  e,l)(e  -  sum[l]  and 
(any  b)(if  in[b,l]  then  not(b  -  0))  and 
any  b,c)(if  ln[a,l]  and  ln[b,l]  and  not(b  -  0) 
then  a*b  -  0) 

An  additive  measure  is  a  function,  whose 
arguments  are  additive  over  partitions,  defined: 

measure  -df  (lam  m)(lam  e, 1 ) (partition [e,l] 
Implies  m[e]  -  sum[map[m,l] ]) . 

3.3  Assumptions 

We  distinguish  between  a  formula  as  a  data 
object,  and  assuming  it.  Having  the  formula  as 
an  assumption  means  that  we  can  use  it  in 
deduction.  For  example,  entering  the  commands 
assume  a  >  0; 
simplify  absolutevalue [a] ; 
returns  'a',  because  the  evaluation  of 
' ablsolutevalue(a) '  utilizes  the  assumption  that 
a  >  0.  The  user  may  want  to  see  the  assumptions 
which  were  used  in  the  simplification.  To 
obtain  this  the  simplify  command  followed  by  ?' 
will  result  in  displaying  the  assumptions  used. 

simplify  absolutevalue [a] 7 
returns  'a;  depending  on  a  >  O' . 

In  data  analysis  one  often  has  only  partial 
information  about  a  function,  such  as  knowing 
its  values  for  specific  arguments  without  having 
an  algorithmic  definition.  We  represent  such 
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information  with  commands  of  the  form 
assume  f[a]  -  k; 

The  equality  symbol  in  this  context 
represents  the  simplification  relation.  The 
command  to  simplify  'f(a]'  results  in  'k' . 

The  statement  "the  kinds  of  coal  are 
anthracite,  lignite,  and  bitumunous"  is 
represented  as  an  assumption.  We  represent 
'kind'  as  a  function  which  when  applied  to 
'coal'  yields  a  list.  Namely, 

assume  kind[coal]  -  (anth  lign  bitum) ; 

Note  that  the  terms  in  this  formula  such  as 
'kind'  and  'coal' ,  may  not  be  defined.  The 
question  "what  kinds  of  coal  are  there?"  is 
represented  as  the  simplification  request 
to  simplify  *kind[coal]'  which  simplifies 
as  the  list  ’ (anth  lign  bitum) ' . 

For  data  analysis  it  will  be  important  for  us 
to  total  numerical  lists  on  the  basis  of  the 
assumption  that  the  list  represents  the  values 
of  an  additive  measure  over  a  partition.  For 
example,  on  the  basis  of: 
assume  measure [m] ; 
assume  partitlon[a, (al  a2  a3) ] ; 
assume  m[(al  a2  a3)]  -  (142); 

the  command  to  simplify  'm[a]'  returns  '7',  as  a 
result  of  the  inference: 

m[a]  -  sum [map [ra, (al  a2  a3 ) ] ] 

-  sum[l  4  2] 

3 . 4  Context  Dependency 

Knowledge  representation  for  data  analysis  is 
dynamic.  Objects  get  added  and  their  relation¬ 
ships  change  during  the  span  of  existence  of  the 
system.  In  order  to  represent  reasoning  about 
the  dynamic  aspect  of  knowledge  acquisition,  we 
need  to  represent  the  context  dependency  of 
information.  Context  dependency  Includes 
temporal  and  source  relations.  To  require  that 
all  contextual  dependencies  be  explicitly  stated 
by  the  user,  or  that  all  assumptions  used  in  a 
deduction  be  of  the  same  sources,  is  too  strong. 
A  user  may  not  be  aware  what  these  are,  and  he 
may  sometimes  want  to  use  information  from 
different  sources  anyway. 

By  representing  context  dependency  we  mean 
that  the  user  may  choose  to  specify  contextual 
dependencies  explicitly,  or  he  may  choose  to 
omit  it.  In  this  case  the  system  will  record  a 
contextual  reference,  which  may  be  recovered  at 
will.  For  instance,  the  system  can  record  the 
time  the  definition  or  assumption  was  entered, 
the  terminal  used,  etc.  Similarly,  in  deductive 
inference,  one  may  choose  to  Ignore  the  fact 
that  assumptions  used  have  different  contextual 
dependencies  (e.g.,  we  may  want  to  Ignore  the 
fact  that  definitions  were  given  by  different 
sources,  or  entered  in  different  times),  or  we 
may  want  to  specify  that  only  assumptions  of  a 
specified  source  should  be  used.  A  typical  case 
of  discrepancy  arises  when  we  ignore  contextual 
dependencies,  derive  a  contradictory  conclusion 
and  the  we  try  to  resolve  it  by  restricting  the 
context  of  the  assumptions. 

In  Algos  commands  which  modify  the  knowlege 
base  index  the  new  information  by  their  "origin” 
which  is  the  time  and  source  of  the  statement. 
For  example,  consider  the  commands 
a  -df  5 ; 
a  -df  6 ; 

and  let  tl  and  t2  be  the  times  in  which  these 
commands  are  executed.  We  view  'a'  as  one 


entity  with  two  values  at  the  two  times.  Thus, 
simplify  a/tl; 
evaluates  to  '5',  while 
simplify  a/t2; 

evaluates  to  '6'.  Similarly,  we  may  specify  the 
source  of  a  definition.  E.g.,  the  commands 
a  -df  3/DOE; 
a  -df  4/Census ; 

specify  values  for  a  for  each  of  the  two 
sources,  DOE  and  Census. 

4.  Satisfying  the  requirements  in  Algos 

Statistical  tables  are  viewed  as  linguistic 
entities.  The  language  of  data  is,  however,  a 
very  restricted  linguistic  context  in  that  data 
statements,  in  general,  represent  the  values  of 
a  measurements  of  some  observed  "chunk  of 
reality".  The  enterprise  attempted  here  is  to 
represent  these  notions  within  a  formal 
deduction  system.  We  answer  questions  using 
information  expressed  in  data,  and  in 
assumptions  and  definitions.  However,  as  with 
any  language,  sentences  may  mean  different 
things  to  different  individuals .  Successful 
communication  depends  on  consensus  of  meaning 
regarding  the  descriptive  terms  of  the  data 
sentences .  LOD  approach  to  this  problem  is  to 
choose  classes  of  descriptors  and  index  their 
meanings  by  context.  This  theory  represents 
statistical  data  in  terms  of  an  explicit  set  of 
descriptors.  This  appears  to  be  adequate  for 
representing  a  large  class  of  data. 

4.1  Statistical  Data 

Following  Dolby,  the  basic  communicated 
entity  is  a  datum,  which  is  a  statement 
concerning  the  value  of  a  measurement.  The 
measurement  is  associated  with  an  observed 
"event" ,  and  the  descriptive  information  in  the 
sentence  identifies  the  event,  the  object 
measured,  how  it  was  measured  and  the  result  of 
the  measurement.  Dolby  has  suggested  that  the 
descriptive  part  of  statistical  data  statements 
has  the  following  components:  Aspect  (of 
measurement).  Object  (of  measurement),  Unit  (of 
measurement).  Event  observed,  Observer,  Matter, 
Activity,  Space,  Time,  and  Source. 

Information  lies  in  the  comparison  of  related 
data,  and  not  in  singular  data  (Dolby,  Clark, 
1982).  One  of  the  problems  we  face,  therefore, 
is  to  characterize  ways  in  which  data  are 
related,  in  order  to  validate  data  comparisons. 
Analysis  of  examples  yields  that  it  is  the 
descriptions  and,  specifically,  the  event 
descriptors,  which  serve  to  check  data 
comparability,  and  we  need  to  indicate  how. 

Consider  for  instance  the  descriptions:  "the 
amount  of  total  end-user  consumption  in  the  U.S. 
in  1980  in  quadrillion  BTU's  (as  reported  by 
MER)",  and  "the  amount  of  industrial  consumption 
in  the  U.S.  in  1980  in  quadrillion  BTU's  (as 
reported  by  CTl)".We  need  to  represent  these 
descriptions  in  such  a  way  that  their 
comparability  can  be  verified.  The  question  of 
characterizing  the  "relatedness"  of  data  can  be 
reduced  to  questions  regarding  relations  between 
the  corresponding  components  of  the  descriptors. 
For  example,  table  Tl  (Fig.  2)  displays  data 
which  share  all  their  descriptive  components 
except  those  in  the  observer  and  time  slots. 
Similarly,  the  comparability  of  the  data  quoted 
above  follows  from  the  fact  that  they  differ 
only  in  their  users.  However,  as  we  have  noted, 
different  sources  often  use  the  same  terms  in 
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different  ways.  Hence,  even  in  this  simple 
case,  elementary  parsing  is  insufficient  and  we 
need  also  to  check  that  the  terms  used  in  these 
descriptors,  e.g.,  'consumption',  do  indeed 
carry  the  same  meaning. 

Following  Dolby  (1984b),  we  represent  the 
datum  as  the  assumption  that  the  value  of 
measuring  the  amount  of  energy 
associated  with  the  observed  event  is  76.201. 
Amount  (aspect  of  measurement)  is  represented  as 
a  function  whose  arguments  are:  energy  (object 
measured),  q-btu  (unit  of  measure),  and  the 
event  observed.  Ue  can  represent  the  above 
datum  as  the  following  assumption  which  may  be 
entered  into  the  Algos  system  with 
assume  amount [ energy ,q-BTU, 

<all-end-users ,  energy,  consumption, 
U.S.,  1980, MER>]  -  76.201; 

The  representation  of  MDB  exploits  our  basic 
representation  of  a  statistical  datum.  Namely, 
the  MDB  presented  in  Fig.  1  is  viewed  as  a 
collection  of  statistical  data,  and  is 
represented  as  a  function  in  the  argument 
<energy, q-btu, 

«distr  ibuter ,  sector> , 
energy ,  di s tr ibut ion , ,  U .  S ,  year ,  D0E» . 

The  MDB  is  represented  as  an  assumption  about 
the  function  'amount' .  Thus 

assume  amount [energy, q-btu, 

«distr ibuter  [ energy ]  ,  sector  [ end-user ] > , 
energy , distribution, 

U.S. ,year[1977-1980],DOE>] 

-  (2.14,  2.57 . ); 

Tables  are  viewed  as  displays  of  (comparable) 
data,  and  are  represented  in  terms  of  lists  of 
components.  The  information  in  the  list  is 
sufficient  to  support  operation  on  data.  Table 
T1  (Fig.  2)  describing  the  "U.S.  consumption  of 
energy  in  1977-80  by  end-use  sector  and  year"  is 
represented  as  the  list 

(<end-user, energy .consumption, US, 1977-80, MER> 
<amount , energy , q - BTU> 
end-user 

(industrial  res&com  transportation) 
1977-80 

(1977  1978  1979  1980) 

(29.024  29.373  . 

.  )) 

4.2  Dependency,  Inconsistency  and  Revision 
An  Algos  theory  is  a  collection  of 
definitions  and  assumptions  from  a  knowledge 
base.  Users  are  free  to  enter  new  assumptions 
into  the  theory.  The  Algos  language  is 
sufficiently  rich  to  express  the  context 
dependencies  of  these  assumptions.  This  has 
been  demonstrated  for  single  dependencies 
(Graves,  Manor,  1985).  Our  future  research  will 
concentrate  on  inconsistencies  and  revision  by 
distinguishing  and  representing  multiple 
theories  along  the  following  lines. 

Context  dependency  can  be  viewed  as  a  means 
for  Identifying  a  (sub)theory,  in  that  all 
entities  and  statements  restricted  to  a  context 
determine  a  theory.  Of  course,  since  there  is 
practically  no  restriction  on  contexts,  there 
are  no  restrictions  on  what  sets  of  statements 
could  make  up  a  theory.  In  the  Algos  system 
inconsistency  can  only  arise  as  the  result  of 
assumptions  Introduced  by  a  user.  When  a 
contradiction  is  detected,  the  computation  steps 
can  be  examined  to  locate  what  assumptions  were 


used,  and  the  contradiction  may  then  be  resolved 
by  introducing  further  contextual  distinctions 
in  the  contradictory  hypotheses.  Thus,  we  can 
consider  the  contradictory  hypotheses  as 
creating  alternative  theories.  Moreover,  we  can 
consider  a  theory  identified  by  some  context 
(e.g.,  all  assumptions  associated  with  the  DOE 
source)  and  trace  its  evolution  through  time. 

A  data  analysis  system  will  use  a  number  of 
primary  external  data  sources  such  as  the  DOE 
and  Census.  These  sources  use  relatively  stable 
definitions  of  their  terms  and  stable 
assumptions  about  them.  However,  they  may 
change  over  time.  The  terms  occuring  in  a  table 
whose  origin  is  some  beaureaucratic  entity  such 
as  DOE  are  indexed  by  that  source.  Each  source 
entity  such  as  DOE  has  a  collection  of 
definitions  and  assumptions  Indexed  by  it. 

The  discrepancies  discussed  in  Section  2  can 
be  resolved  by  adding  the  following  assumptions 
in  which  contextual  dependency  is  explicit, 
assume  sum[  )  -  ...,/CTl; 

assume  (state  or  district) [US]  — 

(alabama  . )  /census; 

assume  day[month]  - 

(15[month] . . . 14 [month+1] ) /census ; 
assume  day[month]  - 

(firstfmonth]  . . . . last [month ]) /DOE; 
The  strategy  for  resolving  inconsistencies 
the  data  is  based  on  the  principle  that  any 
non- identical  entities  should  have  different 
names.  Ue  attempt  to  resolve  inconsistencies  by 
disambiguating  the  descriptions.  Data  are 
usually  aggregations  over  partitions.  A  first 
step  is  to  look  in  the  conflicting  data,  for  the 
partition  used,  and  to  identify  the  terns  used 
to  describe  the  aggregation  and  its  components. 
Typically,  a  discrepancy  arises  by  using 
different  partitions  of  the  same  aggregation. 

In  these  cases  we  distinguish  them  by  reference 
to  the  context  as  above. 

In  Algos  there  are  no  restrictions  on  what 
could  serve  as  a  context  dependency,  although  in 
data  analysis  the  dependencies  of  Interest  are 
relative  to  time  and  source.  Given  appropriate 
temporal  assumptions,  we  can  consider  a  theory 
identified  by  some  source,  e.g.,  all  assumptions 
associated  with  the  DOE)  and  trace  their 
evolution,  compare  conclusions  based  on 
different  assumptions,  and  decide  on  their 
relative  reliability,  without  having  to  delete 
any  information  from  the  knowledge  base. 
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A  Front-End  for  GLIM 

J.A.  Nelder  and  D.  Wolstenholme,  Imperial  College. 


1 .  INTRODUCTION 

GLIM  is  a  package  built  round  an  algorithm 
for  fitting  generalised  linear  models  (GUIs) 
(McCullagh  and  Nelder,  1983).  It  is  currently 
distributed  to  more  than  900  sites  in  50  countries 
It  has  in  addition  facilities  for  data  handling, 
tabulation,  scatter  plots  and  histograms, 
and  an  interpretive  language  with  control 
structures  for  branching  and  looping.  For 
a  full  description  see  Payne  et  al  (1986). 

Like  most  current  statistical  packages  GLIM 
assumes  that  the  user  knows  how  to  do  an  analysis, 
and  provides  him  with  easily-used  tools  for  doing 
it.  This  paper  describes  a  knowledge-based 
front  end  (KBFE)  for  GLIM,  currently  under  con¬ 
struction,  which  embodies  statistical  expertise 
to  aid  the  user  in  the  choice  of  models  for  his 
analysis.  The  front-end  will  not  be  a  black 
box,  delivering  the  'correct'  analysis  to  the 
user,  and  requiring  from  him  little  more  than  a 
description  of  his  d  a.  Our  aims  for  the  KBFE 
are  three-fold: 

(i)  to  give  good  advice  to  the  user  on  the 
analysis 

(ii)  To  do  so  in  a  way  that  encourages  the 
user  to  do  better  next  time 

(iii) To  satisfy  the  requirements  of  a  range 
of  user  skills. 

The  front-end  will  be  one  with  fixed  rules, 

1. e.  it  will  not  be  intelligent  in  the  sense  of 
Student  (Gale,  1985),  which  learns  from  its  ex¬ 
perience  of  users  and  modifies  its  rules  accord¬ 
ingly.  It  will  be  an  expert  system  in  the 
sense  that  it  contains  rules  which  encapsulate 
expertise. 

2 .  TOOLS 

The  front-end  is  being  written  in  Prolog,  a 
declarative  language  for  logic  programming.  GLIM 
is  the  algorithmic  engine  for  the  system,  and 
the  Prolog  being  used  has  its  own  front-end  in 
the  form  of  APES  (see  Section  5.1).  Communication 
between  the  front-end  and  GLIM  is  controlled  by 
Unix,  which  is  the  operating  system  for  the 
development . 

2.1  GLIM 

We  use  GLIM  3.77,  which  is  written  in  re¬ 
stricted  Fortran  77.  The  interpretive  language 
has  statements  in  free  format  each  beginning  with 
a  directive  name  of  the  form  Sletters,  e.g.  SFIT, 
SSORT  etc.  These  are  followed  by  none  or  more 
arguments,  usually  scalars  or  vectors:  however 
SCALCULATE  allows  expressions  with  vectors  and 
scalars  as  operands  and  the  SFIT  directive  uses 
a  model  formula  for  the  linear  component  of  the 
model  (Wilkinson  and  Rogers,  1973).  A  set  of 
statements  may  be  named  as  a  macro,  and  the  dir¬ 
ectives  SLOOP  and  SSWITCH  with  macros  as  arguments 
allows  looping  and  branching.  Subfiles  may  con¬ 
tain  any  mixture  of  data,  statements  to  be 
executed  and  macros.  The  program  may  be  dumped 
at  any  stage,  and  its  current  state  restored 
later . 

2.2  Siqma-PROLOG 

PROLOG  is  the  most  widely-used  logic  programming 
language  (Hogger,  1984).  Its  basis  is  a  subset 
of  first-order  predicate  logic,  with  certain 
extensions.  Using  PROLOG,  the  programmer  may 


describe  the  logical  structure  of  a  problem 
directly  instead  of  being  forced,  as  with  con¬ 
ventional  procedural  languages,  to  describe  in 
detail  the  steps  the  computer  must  take  to  solve 
the  problem.  This  makes  PROLOG  a  good  tool  for 
expressing  knowledge,  since  the  knowledge  can 
stand  alone,  uncluttered  by  computer  control 
instructions. 

A  typical  logic,  or  PROLOG,  program  describing 
knowledge  about  why  a  car  might  fail  to  start  is: 
fails-to-start  (_any  car)  if 
has-f lat-battery  (_any-car) 

fails-to-start  (_any-car)  if 
not  has-petrol  (_any-car) 

has-petrol  (my-car) 

has-f lat-battery  (my-car) 

where  an  underscore  symbol  at  the  beginning  of 
a  word,  as  in  _any-car,  indicates  that  the  word 
is  a  variable. 


The  PROLOG  interpreter  can  use  this  program  to 
solve  certain  problems,  such  as 

fails-to-start  (my-car)? 

using  the  necessary  inference  rules.  PROLOG  may 
also  be  seen  as  a  high-level  procedural  pro¬ 
gramming  language,  since  it  employs  an  inference 
mechanism  known  as  'resolution',  whereby  the 
first  rule  shown  may  be  seen  as  having  the  pro¬ 
cedural  reading:  to  show  that  a  car  fails-to- 
start  then  show  that  the  car  has-f lat-battery . 

LPA  sigma-PROLOG  is  a  dialect  of  PROLOG,  written 
in  C,  which  is  suitable  for  use  on  machines  that 
support  UNIX,  such  as  the  VAX  11/750  used  in 
this  project.  It  is  a  low-level  implementation, 
with  a  LISP-like  syntax,  suitable  for  systems 
development.  It  may  have  front  ends,  such  as 
APES,  added  to  it,  to  provide  additional  features 
or  alternative  syntaxes.  Thus  the  first  rule  given 
above,  in  standard  sigma-PROLOG  syntax  is 
((fails-to-start  _any-car) 
(has-flat-battery  _any-car)) 
Sigma-PROLOG' s  most  useful  features  for  the  de¬ 
velopment  of  the  GLIM  KBFE  are  modules  and  the 
FORK  primitive  which  permits  the  spawning  of  child 
processes  in  UNIX,  e.g.  FORTRAN  programs  such  as 
GLIM.  Communication  with  these  processes  is  carried 
out  simply  using  UNIX  pipes  as  shown  in  Figure  1. 
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3.  GENERAL  PROPERTIES  OF  THE  FRONT-END 

There  are  two  general  characteristics  of  the 
front-end  which  deserve  discussion 
3 . 1  Transparency 

An  important  property  of  Unix  is  its  t rans- 
pa rency ,  by  which  we  mean  that  all  the  tools 
available  to  the  operating  system  are  also  avail¬ 
able  to  the  user.  Compare  this  with  the  older- 
style  operating  systems  where,  for  instance,  tools 
for  creating  and  searching  directories  would 
certainly  have  been  created  by  the  originators  but 
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would  not  be  accessible  by  the  user  for  his  own 
purposes.  We  think  that  expert  systems  should 
ba  similarly  transparent,  so  that  if  the  system 
uses  certain  procedures  to  obtain  information  on 
which  to  base  advice,  the  user  should  have  access 
to  the  same  procedures  for  his  own  activities. 

We  believe  that  transparency  will  aid  the  trans¬ 
fer  of  expertise  and  encourage  the  user  to  learn 
from  that  expertise;  at  the  same  time  we  recog¬ 
nize  that  the  user  who  does  not  wish  to  think  for 
himself,  but  rather  hankers  after  a  black  box, 
may  not  like  the  system.  We  do  not  aim  to  cater 
for  this  class  of  user. 


mental  or  observational),  etc.,  de¬ 
signed  to  establish  whether  GLIM  is 
suitable  and  to  give  guidance  about 
forms  of  analysis  likely  to  be 
appropriate. 

DV  data  validation 

-  detection  of  gross  errors  and  incon¬ 

sistencies. 

DE  data  exploration 

-  mainly  graphical  techniques,  to 

determine  possible  transformations 
and  initial  settings  of  link  and 
variance  functions. 
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3.2  Libertarianism 

An  authoritarian  system  is  one  that  controls 
the  sequence  of  operations  for  a  user,  in  the 
sense  that  if  a  certain  state  is  reached,  then 
other  future,  and  hitherto  possible,  options  may 
now  be  barred.  Thus  it  might  be  that  if  in  a 
simple  linear  regression  a  quadratic  term  is 
found  to  be  significant  then  future  actions  in¬ 
volving  the  use  of  the  linear  fit  only  would  be 
barred.  An  authoritarian  system  is  one  in  which 
the  system  knows  best  -  always.  The  alternative, 
which  we  favour,  allows  that  the  user  may  have 
background  knowledge  that  the  system  does  not 
know  about  and  cannot  easily  discover.  There  is 
a  pragmatic  argument  for  libertarian  systems  in 
statistics,  quite  apart  from  more  philosophical 
ones.  Thi  :  is  that  the  rules  for  such  systems 
are  themselves  abstractions  from  whatever  fields 
of  application  the  originators  know  about,  and 
that  particular  knowledge  of  those  fields  cannot 
be  part  of  the  rule  system;  thus  the  user  will 
always  bring  background  knowledge  to  his  analysis 
which  must  be  given  full  opportunity  for  expression. 
4.  LARGE-SCALE  STRUCTURE 

Prolog  per  se  is  too  low-level  a  language  to 
use  in  an  unstructured  way  for  constructing  a KBFE. 
Thus  higher-level  structures  must  be  developed 
for  expressing  the  large-scale  structure.  In 
addition  a  general  facility  is  needed  for  time- 
stamping  information.  In  Prolog  an  assertion 
once  made  stays  true;  such  a  property  is  not 
suitable  for  a  system  involving  trial-and-error 
learning.  The  method  of  time-stamping  infor¬ 
mation  is  described  in  Section  4.  Another 

feature  of  Prolog  that  needs  attention  is  the 
assumption  of  negation  as  failure;  this  says 
that  a  fact  that  cannot  be  established  to  be  true 
is  taken  as  false,  i.e. 

not  established  to  be  true  =  established 
not  to  be  true 

This  closed-world  assumption  must  also  be  modified 
by  use  of  a  predicate  defining  'established'  if 
trial-and-error  learning  is  to  be  correctly 
modelled. 

4 . 1  Nodes 

The  analysis  process  has  been  split  into  a  set 
of  activities,  each  defining  a  node.  These,  with 
their  two  letter  abbreviations  and  brief  indication 
of  scope,  are: 

SE  set-up  user  environment 

DI  data  input 

-  get  data,  with  names  of  variables, 
for  analysis. 

DO  data  definition 


MS  model  selection 

-  procedure  tor  selecting  one  or  more 

parsimonious  models  for  the  data. 

MD  model  display 

-  display  of  statistics,  e.g.  fitted 

values  and  residuals,  associated 
with  models  selected. 

MC  model  checking 

-  checking  the  adequacy  of  models  se 

lected  using  various  techniques. 

MP  model  prediction 

-  summarising  results  of  models  found, 

including  calculation  of  summary 
statistics  and  measures  of 
uncertainty. 

The  nodes  can  be  thought  of  as  the  nodes  of  a 
graph  and  the  strategy  of  an  analysis  may  be 
summarized  by  the  particular  path  taken  through 
the  graph,  together  with  variables  defining  the 
state  at  each  node.  The  path  will  reflect  the 
user’s  choice  of  methods,  previous  analyses  and 
prior  knowledge  of  the  data  set. 

4,2  Tasks 

A  new  command  language  is  being  developed  to 
allow  various  tasks  to  be  invoked  by  high-level 
commands.  These  tasks  may  be  broadly  divided 
into  two  categories: 

(i)  those  providing  general  facilities 

such  as 

access  to  the  operating  system, 
direct  access  to  GLIM 
background  information 
changing  activity 
changing  mode  of  use;  e.g.  from 
giving  tasks  directly  to 
obtaining  advice 
quitting  the  system; 

(ii)  those  concerned  with  the  details  of 

statistical  analysis  using  the  front- 

end,  such  as 

inputting  data 
creating  analytical  trees 
finding  basic  statistics 
plotting  graphs. 

The  syntax  for  invoking  both  categories  of  task 
is  the  same.  Each  task  is  invoked  by  a  keyword, 
e.g.  'find',  'create',  etc.,  possibly  followed  by 
a  sequence  of  keywords  and  variables.  The  syntax 
of  each  task  is  designed  to  facilitate  both  check¬ 
ing  and  prompting,  since  a  given  keyword  uniquely 
determines  the  sequence  of  variables  following  it, 
together  with  the  set  of  keywords  from  which  the 
next  keyword  can  come. 

When  the  user  specifies  a  task  to  be  done,  a 


user  provides  information  about 
variables  (e.g.  whether  con¬ 
tinuous  or  counts),  the  data 
structure  (i.e.  whether  experi- 


four-level  checking  process  is  first  undertaken, 
(i)  Pattern  matching  -  the  first  syntax 

check  -  whereby  the  task  specified  is 
matched  against  the  syntax  of  possible 
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tasks.  Failure  to  match  results  in 
prompts  which  tell  the  user  how  much  of 
the  task  specified,  working  from  left  to 
right,  matches  a  possible  task,  and  how 
this  might  be  completed. 

For  example,  the  three  tasks  available  for 
entering  data  from  the  keyboard  take  the 
forms: 

keydata  vector  _name  _values 
keydata  rows  _names  _values 
keydata  columns  _names  _values. 

If  the  user  specified  task 

keydata  (ABC)  (123455687) 
i.e.  with  the  second  word  missing,  no 
pattern  match  would  be  found,  and  the 
following  message  would  appear. 


the  first  part  of  the  task  given 
keydata 

may  be  completed  as  follows 

vector  <name  of  vector> 

<bracketed  list  of  values  of 
vector> 

rows  <bracketed  list  of  vectors> 
cbracketed  list  of  values  of 
given  vectors> 

columns  <bracketed  list  of  vectors> 
cbracketed  values  of  given 
vectors>. 


(ii)  Basic  type  checking  -  the  second  syntax 
check.  Following  a  successful  pattern 
match,  the  type  of  each  variable,  if  any, 
is  checked. 

For  example,  if  the  user  specified  the 
task 

keydata  rows  (ABC)  (123455687) 
the  list  (ABC)  would  be  checked  to  en¬ 


when  GLIM  executes  commands.  Therefore,  the  GLIM 
output  from  any  command  that  might  cause  error  is 
checked  for  error  messages.  Since  errors  can 
leave  the  state  of  GLIM  in  an  unknown,  inter¬ 
mediate  state,  detection  of  an  error  leads  to 
restoration  of  the  previous  state. 

4.3  Time  stamping 

Both  GLIM  and  the  user  are  sources  of  infor¬ 
mation  for  the  front-end.  During  an  analysis, 
however,  the  state  of  the  GLIM  environment  changes 
Similarly,  the  user's  mind  may  also  be  considered 
to  change  'state'  as  knowledge  is  acquired  and 
actions  are  carried  out.  Information  found  from 
either  of  these  sources  cannot  therefore  be 
assumed  to  remain  valid  throughout  a  session. 

A  simple  solution  is  to  assume  that  an  answer 
holds  true  only  when  asked:  this  solution  would, 
however,  be  unacceptable  since  the  user  might 
then  be  asked  the  same  question  many  times  over. 
Instead,  all  information  found  from  the  user  or 
GLIM  is  time-stamped  so  that  the  interpreter  can 
later  'decide'  whether  or  not  such  information 
still  holds  true  following  certain  actions.  Only 
if  its  validity  is  doubtful  will  the  user  or  GLIM 
be  re-queried. 

5.  THE  USER  INTERFACE 

The  lowest  level  of  the  user  interface  is  pro¬ 
vided  by  APES,  the  Prolog  front-end. 

5,1  APES 

APES,  Augmented  Prolog  for  Expert  Systems,  was 
developed  by  Hammond  and  Sergot  (Hammond,  1982) 
to  provide  a  logic-programming  environment  suited 
to  the  creation  and  development  of  knowledge- 
based  systems  and  other  logic-programming  software 
Many  modifications  have  been  made  to  it  for  this 
project. 

The  main  features  of  APES  useful  for  KBFE  develop¬ 
ment  are: 


sure  that  each  item  on  the  list  was  a 
valid  vector  name,  and  the  second  list  to 
ensure  that  each  item  was  a  number. 

Failure  of  any  type  check  results  in 
failure  of  the  task;  the  user  could  then 
explore  the  reason  for  failure,  using  the 
explanation  facilities  of  APES,  if  (s)he 
wished. 

(iii)  Context-free  checks  -  the  first  check  on 
semantics.  Checks  to  ensure  that  the 
task  is  feasible  in  some  context. 

For  example,  if  the  user  specified  the 
task 

keydata  rows  (ABC)  (1  2345687) 
checks  would  fail  since  the  number  of 
values  is  not  a  multiple  of  the  number  of 
vectors  named.  As  with  type-checking, 
failure  would  be  open  to  explanation  and 
exploration. 

(iv)  Context-sensitive  checks  -  the  final 

semantic  check.  Checks  to  ensure  that 
the  task  is  feasible  or  acceptable  in  the 
current  state  of  the  analysis.  For  ex¬ 
ample,  if  entering  data,  the  names  of 
vectors  being  defined  should  not  already 
have  been  used.  Full  explanation  of 
failure  would  again  be  available. 

The  actions  to  be  carried  out  for  a  particular 
task  may  involve  sending  commands  to  GLIM,  print¬ 
ing  output  to  the  screen,  asserting  facts  to  the 
database,  accessing  files,  or  some  combination  of 
these.  Although  the  checks  reduce  the  likelihood 
of  error,  these  may  occur  (e.g.  division  by  zero) 


( i )  Declarative  dialogue  handling 

In  a  KBFE,  some  facts  must  of  necessity  be 
obtained  from  the  user.  Therefore,  the  system 
must  query  the  user  for  the  relevant  data 
when  needed.  APES  handles  this  interaction 
with  the  user  declaratively  (Sergot,  1982). 

The  main  concept  of  query-the-user  is  that 
the  program  available  to  the  interpreter 
can  be  seen  as  a  combination  of  rules  and 
facts  within  the  computer  and  the  extra 
information  in  the  user's  mind.  If  the 
query  to  be  solved  concerns  a  relationship 
not  defined  in  the  computer,  it  is  assumed 
that  the  user  can  supply  the  necessary  infor¬ 
mation.  The  interpreter  obtains  this  infor¬ 
mation  by  printing  a  question  and  accepting 
an  answer.  For  example,  if  the  query  to  be 
solved  is 

has-age  (Fred  _years) 

where  "has-age"  is  not  defined  in  the  program, 
APES  evaluation  results  in  the  interaction 
which  (_years  :  has-age  (Fred  _years))7 
Answer  is  17 

where  17  is  the  user's  response. 

This  approach  should  be  contrasted  with  the 
more  usual  procedural  one,  in  which  "has-age" 
might  be  defined  by  a  rule  such  as: 

has-age  (_person  _years)  if 

print  (How  old  is  _person?)  and 
read  (_years) 

Such  an  approach  is  rejected  in  APES  because 
the  rule  has  no  acceptable  logical  reading, 
i.e.  it  is  not  true  that  a  person's  age  is 
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the  logical  conclusion  of  printing  and  reading. 

( ii )  Explanation  facilities 
the  user  may  ask: 

why  a  question  is  being  asked; 

how  a  solution  or  answer  was  reached; 

why  a  solution  was  not  reached. 

(iii)  Natural  language  templates 

The  explanations  given  by  APES  are  in  terms  of 
the  rules  used.  To  improve  their  readability 
the  programmer  may  specify  natural-language 
equivalents  of  the  rules  and  conditions  used 
in  the  explanations, 
e.g.  instead  of 

fails-to-start  (_any-car)  if 
has-f lat-battery  (_any-car) 
the  explanation  might  state 

_any-car  will  fail  to  start  if 
it  has  a  flat  battery. 

Similar  natural-language  templates  may  be  used 
to  improve  the  wording  of  questions. 

( iv)  Validity  constraints 

In  order  to  ensure  that  the  answers  given  by 
the  user  are  sensible,  the  programmer  may 
specify  validity  constraints  on  any  answers 
given.  Thus,  if  asking  Fred's  age,  a  simple 
check  would  ensure  that  the  answer  is  a  non¬ 
negative  integer.  Failure  is  open  to  ex¬ 
planation. 

( v)  Advice 

The  user  may  be  asked  either  for  hard,  objective 
facts  or  for  opinions. 

In  the  latter  case,  it  might  help  the  user  if 
the  system  could  offer  its  own  suggestions, 
which  the  user  could  then  accept  or  reject. 

To  provide  such  advice  the  programmer  may 
specify  secondary,  "consultative"  relations 
which  should  be  used  to  provide  such  advice. 

This  advice  is  only  offered  if  requested,  and 
even  then  can  be  rejected,  so  the  user  remains 
firmly  in  control. 

( vi )  Visual  prompts 

During  a  dialogue,  the  user  may  be  asked  a 
question  of  a  graphical  nature,  e.g.  "is  Y 
linear  against  X?".  To  help  him  answer  this 
question,  the  user  may  demand  to  see  a 
"visual  prompt",  e.g.  a  plot  of  Y  against  X. 
These  prompts  are  given  only  if  asked  for. 

5.2  User  questions  at  the  statistical  level 
On  top  of  APES’  query-the-user  we  have  allowed 
for  higher-level  queries  by  the  user  on  stat¬ 
istical  aspects  of  the  front-end.  These  are  of 
three  types. 

(  i )  Definition  of  terms 

The  user  may  not  understand  the  meaning  of  a 
question  because  a  word,  e.g.  'aliasing',  is  un¬ 
familiar.  A  system  may  need  to  supply  such 
definitions  (the  lexicon  of  Gale  and  Pregibon, 
(1982))  to  help  the  user,  particularly  if  (s)he 
is  fairly  inexpert  in  statistics.  Clearly  a 
lexicon,  if  provided  in  great  detail,  would  amount 
to  an  on-line  statistical  text-book;  we  do  not 
plan  to  provide  such  detail,  but  to  restrict 
ourselves  to  possibly  less  familiar  words. 

( i i )  Explanation  of  questions 

More  generally  the  user  may  want  to  reply  to 
a  question  with  'why  are  you  asking  me  this 
question?',  i.e.  (s)he  may  want  some  background 
or  information  on  why  a  particular  line  is  being 
pursued.  As  with  definition,  this  could  lead  to 
the  writing  of  very  large  amounts  of  text  if 


pursued  in  great  detail.  Again  we  plan  limited 
amounts  of  information,  which  will  provide  a  back¬ 
ground  to  the  strategies  embodied  in  the  rules, 
(iii)  Advice  on  strategy 

The  user's  question  is  here  'what  would  you  do?' 
and  in  the  answers  are  embodied  the  expertise  of 
the  system.  The  advice  is  given  in  terms  of  the 
primitive  tasks  available  at  the  current  node,  in¬ 
cluding  of  course  the  general  tasks  of  moving  to 
another  node  etc.  In  the  final  section  we  out¬ 
line  the  advice  available  at  the  important  model- 
selection  (MS)  node  of  the  front  end. 

6.  AN  EXAMPLE  OF  STRATEGY 

At  the  model-selection  (MS)  node  the  user  can 
get  the  following  background  information  on  the 
organization  of  the  node 
Background  on  MS  organization 

Activity  MS  helps  develop  some  parsimonious 
models  for  the  data.  It  first  establishes 
basic  information  about  the  GLM: 

the  response  variate  ~) 

the  set  of  possible  explanatory  i 

variables 

the  link  function  |  (1) 

the  variance  function  i 

prior  weights,  if  any 
offset,  if  any.  J 

MS  then  searches  for  suitable  sets  of  terms  to  in¬ 
clude  in  a  linear  predictor.  During  the  search 
a  tree  of  possible  models  is  constructed  and  for 
each  node  of  the  tree  the  current  set  of  possible 
terms  in  the  linear  predictor  is  divided  into 
three  categories 

( i )  the  kernel  -  terms  currently  thought 
of  as  necessary 

(ii)  free  terms  -  terms  whose  status  is 
currently  doubtful 

(iii)  rejected  terms  -  terms  removed  from 
further  consideration. 

Each  node  has  a  number  and  is  associated  with  two 
basic  nodes  holding  information  given  in  ( 1 ) 
above. 

There  is  a  set  of  tasks  available,  which  are 
useful  steps  in  model  selection  procedures  and 
are  used  by  the  system's  own  strategy.  They  are 
also  available  to  the  user  for  his/her  own 
purposes. 

The  strategy  used  by  the  front-end  is  currently 
described  as  follows: 

Background  on  MS  strategy 

The  strategy  has  two  main  stages.  Stage  1 
looks  for  sets  of  primary  terms  that  give 
parsimonious  models.  A  primary  term  is  a 
factor  or  a  variate  chosen  from  the  initial  set 
of  possible  explanatory  variables.  Stage  2 
looks  for  additional  compound  terms  that  may 
improve  the  fit.  These  include  squared  (x***2) 
and  cross-terms  (xl  x2)  of  variates,  interaction 
terms  of  factors  ( A. B) ,  or  mixed  terms  like  A.x, 
where  the  slope  for  x  varies  with  the  level  of  a 
factor  A.  In  more  detail: 

Stage  1 

A  series  of  nodes  is  created,  each  represent¬ 
ing  a  tentative  model  as  a  kernel  K  of  currently 
accepted  terms,  and  a  set  of  free  terms  F  whose 
status  is  currently  doubtful.  By  implication 
there  is  also  a  set  R  of  terms  oriqinally  free 
but  now  rejected.  At  each  cycle  the  current  set 
of  free  terms  F  is  split  into  3  subsets 
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FK  :  those  transferred  from  F  to  K 

FF  :  those  remaining  in  F 

FR  :  those  transferred  from  F  to  R. 

At  cycle  0  the  kernel  K  contains  necessary 
terms,  specified  by  the  user  as  being  essential, 
and  F  contains  the  rest  of  the  initial  set  of 
explanatory  variables.  The  three  subsets 
FK,FF,FR  are  obtained  as  follows:  each  free  term 
is  tested  by  forming  two  F-values;  the  forward 
F-value  is  obtained  by  adding  it  singly  to  the 
kernel  and  the  backward  F-value  by  removing 
it  from  the  maximal  model  K+F  which  includes  all 
the  free  terms.  The  denominator  of  the  F- 
statistic  is  either  a  prior  value  of  the  baseline 
mean  deviance  or  is  obtained  from  the  fit  of  the 
maximal  model.  Two  critical  values  for  forward 
and  backward  F-values  are  defined,  and  an  F- 
value  exceeding  its  critical  value  is  called 
positive,  else  negative.  Any  term  yields  one 
of  four  possible  results  and  these  are  allocated 
as  follows: 

forward  F-value  backward  F-value  allocation 
+  +  FK 

+  -  FF 

+  FF 

FR 

Default  settings  for  the  critical  values  are 
both  2.  This  cycling  process  continues  until 
either  (i)  the  set  F  becomes  empty  or  (ii)  the 
set  F  is  non-empty  but  unchanging.  If  (i)  occurs 
the  stage-1  model  selected  is  unique;  if  (ii) 
occurs  each  remaining  free  term  is  transferred  to 
the  kernel  and  the  cycling  repeated.  The  result 
is  a  tree  of  possible  stage-1  models. 

Stage  2 

For  a  selected  stage- 1  model  with  initial  kernel 
K1  and  free  set  FI  and  final  kernel  K2  and  free 
set  F2  (which  may  be  null),  second-order  terms 
(cross-terms)  are  generated  as  follows: 

Let  K'=K1+F1,  i.e.  all  primary  terms  originally 
considered.  Then  generate  all  compound  terms  of 
the  form  K'x(K2+F2),  and  assign  (subject  to 
marginal ity  constraints)  these  terms  to  a  set 
FC.  Using  K'  as  a  working  kernel  we  find  for¬ 
ward  F-values  for  elements  of  FC.  Often  there 
will  be  too  many  terms  to  obtain  backward  F-values, 
so  we  get  a  working  set  of  free  terms  by  select¬ 
ing  the  positive  terms  from  successive  forward 
selections.  This  working  set  is  then  used  with 
a  stage- 1  procedure  to  make  a  selection  of  com¬ 
pound  terms.  -Finally  all  simple  terms  not 
occurring  in  any  of  the  accepted  compound  terms 
are  re-checked  for  inclusion  by  a  stage-1 
procedure. 

The  advice  given  by  the  system  is  in  terms  of 
the  tasks  defined  for  the  node,  and  background 
information  about  these  is  also  available  to  the 
user. 

This  strategy  will  undergo  further  development 
and  refinement  as  the  project  proceeds.  Data 
sets  suitable  for  modelling  by  a  wide  variety  of 
GLMs  are  being  accumulated,  and  will  be  used 
to  test  the  strategy  both  here  and  at  other 
nodes . 
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ARTIFICIAL  INTELLIGENCE  TECHNIQUES  FOR  RETROSPECTIVE  HELP  IN  DATA  ANALYSIS 
William  H.  Nugent,  Harvard  University 


With  the  advent  of  personal  computers 
and  workstations  with  the  computing  power 
and  storage  capacities  of  the  main  frames 
of  15  years  ago,  it  is  a  simple  task  to 
run  interactively  multiple  analyses  on  a 
single  dataset.  We  can  now  do  in  hours 
which  used  to  take  days  or  weeks  in  the 
batch  environment  of  15  years  ago.  But 
the  ability  to  so  easily  explore  a  data 
set  in  a  relatively  short  period  of  time, 
begins  to  strain  our  capacity  to  keep 
the  data  analysis  organized.  The  record 
of  commands,  or  script,  becomes  so 
lengthy  and  complicated  by  the  various 
branchings  and  dead  ends  in  the  process 
of  exploratory  data  analysis,  that  it 
becomes  difficult  to  determine  the 
origins  and  interdependencies  of  the 
objects  and  data  structures  in  the 
computer  workspace.  To  help  the  analyst 
understand  the  evolution  of  his  data 
analysis,  statistical  software  must 
provide  the  tools  for  this  meta-analysis 
problem.  This  paper  presents  a  tool 
which  has  been  developed  to  address  the 
meta  problem  of  script  analysis:  the 
determining  of  the  definitions  and 
interdependencies  of  commands  and 
variables.  This  is  a  natural  area  to 
automate  for  three  reasons: 

1)  Searching  through  a  script  to  find  a 
variable  or  command  reference  is  a 
tedious  process. 

2)  An  analyst  makes  mistakes  when 
searching  manually  through  the  script. 

3)  We  have  proven  A. I.  technology  which 
can  be  applied  to  this  problem. 

This  problem  can  be  partially  solved 
with  the  U3e  of  a  search  command  in  a 
text  editor.  But  this  is  not  a  very 
efficient  solution.  For  example,  in 
searching  for  the  definition  of  a 
variable,  all  occurrences  must  be 
examined,  even  if  only  a  definition  is 
sought.  The  analyst  is  forced  to  perform 
the  following  subtask  before  the  variable 
definition  can  be  found  (Fig.  1). 

SEARCH  FOR  THE  FIRST  OCCURRENCE  OF  THE  VARIABLE 
WHILE  (NOT  A  DEFINITION) 

SEARCH  FOR  NEXT  OCCURRENCE  OF  THE  VARIABLE 

Figure  1. 

The  user  is  doing  the  filtering!  why  not 
off-load  this  task  to  the  computer? 

To  automate  this  subtask  requires  a 
program  which  has  a  syntactical  knowledge 
of  the  computer  language  being  searched 
by  the  analyst.  That  is,  the  program 
must  be  able  to  determine  the  difference 
between  commands,  functions,  variables, 
and  user  defined  command  procedures. 
Further,  such  a  program  must  be  able  to 
distinguish  between  the  different  types 
of  uses  of  a  variable  and  procedures, 
their  definitions,  references, 


modifications,  and  deletions.  I  have 
developed  a  program  which  provides 
precisely  this  kind  of  automation:  the 
SAT  program.  SAT  is  a  set  of  Script 
Analysis  Tools. 

SAT  is  a  general  purpose  program  which 
can  be  easily  modified  so  that  almost  any 
computer  language  can  be  analyzed.  SAT 
is  currently  working  with  ISP  (1),  but  by 
adapted  to  a  different  language  by 
changing  the  parser.  SAT  is  similar  in 
purpose  to  the  Programmer's  Apprentice 
developed  at  MIT's  AI  Lab,  for  providing 
a  means  to  analyze  a  program,  SAT  allows 
the  user  to  examine  his/her  data  analysis 
from  a  more  abstract  level.  SAT  ha3  been 
developed  on  an  IBM  PC  using  Gold  Hill 
Computers'  Golden  Common  LISP. 

SAT's  parser  generates  a  database 
which  is  then  referenced  by  simple 
functions  that  look  for  definitions, 
references,  or  all  occurrences  of 
variables  and  commands.  The  relationship 
between  different  variables  can  also  be 
examined,  resulting  in  forward  and 
reverse  dependency  chains.  The  ability 
to  generate  dependency  chains  between 
commands  is  a  very  powerful  tool  for 
examining  what  the  analyst  has  done  in 
the  data  analysis,  by  showing  how  one 
command  uses  the  results  of  an  earlier 
command.  What  I  have  described  so  far  is 
an  interactive  cross-reference  program 
which  can  highlight  items  of  interest. 

It  is  more  efficient  at  finding  specific 
occurrences  in  script  than  either  a 
manual  search  or  a  search  with  a  text 
editor . 

Because  of  the  limitations  on  memory 
space,  SAT  is  currently  only  familiar 
with  the  syntax  of  the  statistical 
package  ISP.  SAT’s  ability  to  understand 
a  script  currently  includes  only  a 
general  description  of  ISP  commands  and 
functions,  however  even  with  this 
restriction  it  can  perform  useful 
meta-analyses . 

ISP  commands  and  functions  have  order 
dependent  input  and  output  arguments. 

For  example,  the  linear  regression 
command  REGRESS  can  have  two  input 
variables.  The  first  is  the  independent 
variable,  the  second  the  dependent.  SAT 
does  not  distinguish  between  the  two 
inputs.  The  inputs  are  treated  the  same; 
as  inputs.  This  is  also  true  of  results 
returned  by  commands,  SAT  only  notes  they 
are  outputs.  SAT's  knowledge  of  command 
parameters  also  suffer  from  similar 
shortcomings.  Global  parameters  are 
recorded,  but  SAT  does  not  know  which 
parameters  are  used  by  the  various 
commands.  To  understand  why  these 
shortcomings  are  only  superficial  and  not 
a  major  design  flaw,  a  deeper 
understanding  of  the  internal  structure 
of  SAT  is  required. 


IIS 


When  SAT  builds  the  database  from  the 
user  script,  cross-reference  lists  of  all 
occurrences  of  commands,  functions, 
variables,  and  procedures  are  generated. 
When  a  variable  is  being  entered  into  the 
database,  it  is  noted  how  the  variable  is 
being  used:  as  a  definition,  reference, 
modification,  or  deletion.  This  task  is 
made  easier  by  the  general  uniformity  of 
most  ISP  commands. 


isp-command  operands  >  resul ts/parameters(=value) 
let  result  =  operand 


Figure  2. 


In  the  first  form  in  Figure  2,  the 
first  word  is  the  command  name,  the  next 
group  of  words  up  to  the  greater  than 
sign  are  the  inputs,  and  any  words  after 
the  greater  than  sign  are  outputs.  In 
the  second  form,  to  the  left  of  the  equal 
sign  is  the  output  and  everything  after 
the  equal  sign  is  the  input.  Parameters 
can  occur  anywhere  on  the  line  and  are 
preceded  by  a  slash.  But  over  all,  it  is 
very  easy  to  tag  the  different 
occurrences  of  variables  because  of  the 
position  dependencies  of  ISP  commands. 

At  this  time,  SAT  does  not  incorporate 
knowledge  about  the  positional 
dependencies  of  ISP  operands  and  results. 
This  would  require  a  special  handler  for 
each  command  to  label  further  each 
variable.  Currently,  only  the  commands 
which  are  the  exceptions  to  the 
generalized  ISP  command  syntax  have  their 
own  handler;  about  a  dozen  of  the  seventy 
plus  commands  diverge  from  these  two 
general  command  layouts. 

WHY  A  NATURAL  LANGUAGE  INTERFACE? 

During  February  1986,  a  natural 
language  interface  was  added  to  the 
interactive  cross-reference  program.  A 
natural  language  interface  was  chosen 
instead  of  a  menu  driven  system  or 
command  language  for  the  following 
reasons : 

1)  Both  menu  driven  systems,  and  command 
languages  interfaces  are  well  known. 
Both  systems  have  their  own  advantages 
and  disadvantages.  The  majority  of 
menu  systems  are  easy  for  the  new 
user,  but  become  burdensome  for  the 
advanced  user.  The  tree  structure  of 
most  menus  causes  the  advanced  user 
much  trouble  when  having  to  repeatedly 
transverse  up  and  down  the  branches. 
One  area  which  should  be  explored  for 
menu  driven  systems,  which  seems  to  be 
overlooked,  is  the  ability  to  easily 
Jump  between  branches;  similar  to  how 
INFO,  the  EMACS  help  facility,  works. 


2)  The  traditional  command  language 
interface  is  that  the  commands  will 
become  too  complicated  to  use, 
possibly  even  for  experienced  users. 

3)  The  natural  language  interface  was 
chosen  because  it  seemed  possible  to 
design  a  natural  language  interface 
that  would  be  easy  to  code,  and  yet 
still  have  the  ability  to  understand 
complex  commands.  Another  important 
advantage,  is  that  the  system  would  be 
able  to  give  a  higher  degree  of 
feedback;  if  a  command  is  not 
understood,  the  system  can  ask  the 
user  a  question.  When  a  command  is 
understood,  a  generic  statement  could 
be  echoed  to  the  user  telling  what  the 
system  is  doing.  With  feedback  of 
this  nature,  the  user  could  learn  what 
the  limitations  of  the  system  are,  and 
how  to  get  around  them.  More 
experienced  users  could  even  teach  the 
system  new  phrases  and  words. 

The  natural  language  interface 
developed  for  SAT  was  inspired  by  the 
computer  program  ELIZA  written  by  Joseph 
Weizenbaum  of  MIT  (2).  This  type  of 
natural  language  interface,  or  engine, 
has  an  outstanding  feature,  its 
simplicity.  This  kind  of  engine  only  has 
a  superficial  knowledge  of  the  english 
language.  In  brief,  the  way  ELIZA  works 
is  to  process  each  line  of  input  from  the 
user  by  searching  it  for  the  presence  of 
a  keyword.  Associated  with  each  keyword 
is  a  list  of  transformation  rules.  The 
appropriate  transformation  rule  is 
applied  to  the  input,  and  the  program 
would  answer  back  with  a  question  based 
on  the  input.  SAT,  rather  than  asking  a 
question,  performs  an  action.  The  input 
is  transformed  into  a  command  which  is 
then  executed  by  the  interactive 
cross-reference  subsystem. 

For  example,  if  a  user  types  in  "WHERE 
WAS  VARIABLE  PRICES  DEFINED  LAST",  SAT 
would  first  print  out  a  generic  statement 
of  what  it  understood  the  user  request  to 
be.  Then  SAT  would  search  backwards  from 
the  current  position  looking  for  the  last 
definition  of  PRICES.  If  the  definition 
is  found,  it  is  displayed  on  the  screen 
in  reverse  video  along  with  any  comments 
on  the  same  line. 


SAT>  where  was  variable  prices  defined  last 
Find  the  previous  definition  of  variable  prices. 


The  keyword  in  the  above  example  is 
the  word  "variable".  The  associated 
action  rule  requires  the  words  "where" 
"defined"  and  "last"  to  appear  in  the 
approximate  locations  shown  above.  The 
action  rule  allows  for  synonyms.  The 
word  "where"  could  be  replaced  with 
"find",  "locate",  or  "show".  The  same  is 
true  for  the  other  two  words,  "defined" 
and  "last".  Symbolically,  the 
transformation  rule  is  shown  in  Figure  3. 


v 


KiMKMMK'" 


Figure  3. 


The  action  rule  in  reality  is  more 
complicated  than  what  is  shown.  It  is 
able  to  represent  many  types  of  sentences 
in  a  single  rule  by  allowing  some  words 
to  be  translated  to  symbols  which  are 
passed  to  the  underlying  LISP  function. 
The  above  example  has  been  simplified  for 
the  sake  of  understanding  and  brevity. 

If  not  enough  information  is  typed  in 
by  the  user,  SAT  will  respond  with  a 
question.  For  example: 

SAT>  where  was  variable  xmax  (  user  input  ) 
WHAT  ABOUT  VARIABLE  XMAX?  (SAT  response) 


At  this  time,  the  help  system  is  not 
as  good  as  one  would  like.  It  should 
help  the  user  by  listing  some  possible 
keywords  that  would  make  the  user’s  input 
into  a  valid  action  statement. 

The  types  of  questions  SAT  interprets 
are  about  the  location  of  commands, 
functions,  and  variables.  SAT  is  also 
able  to  build  both  forward  and  reverse 
dependency  chains  on  either  variables  or 
command  lines.  A  dependency  chain  can 
best  be  thought  of  as  a  question  such  as 
"WHAT  DEPENDS  ON  VARIABLE  PRICES?"  or 
"WHAT  DOES  VARIABLE  PRICES  DEPEND  ON?". 

In  searching  for  the  occurrence  of  a 
variable,  SAT  is  able  to  distinguish 
between  definitions,  references,  and 
deletions.  The  user  is  able  to  request 
searches  in  either  the  forward  or  reverse 
direction  with  only  the  next  or  all 
occurrences  being  sought. 

At  the  present,  SAT  is  not  able  to 
interpret  complex  user  commands.  A  user 
input  of  "SHOW  ALL  REGRESSION  COMMANDS 
USING  VARIABLE  X  AS  THE  DEPENDENT 
VARIABLE"  is  too  complicated  and,  as  was 
mentioned  earlier,  SAT  does  not  currently 
have  a  parser  sophisticated  enough  to 
differentiate  between  input  and  output 
arguments . 

FUTURE  DIRECTIONS  FOR  SAT: 

The  power  of  SAT  could  be  greatly 
enhanced  by  integrating  it  with  a  full 
parser  for  the  underlying  computer 
language,  such  as  S,  C,  or  some  other 


computer  or  data  analysis  language.  With 
a  full  parser,  SAT  would  be  able  to 
differentiate  between  the  modification  of 
an  array  cell  and  the  definition  of  the 
array.  SAT  will  also  be  able  to 
differentiate  between  order  dependent 
arguments  to  commands  and  functions. 

The  next  area  in  which  SAT  can  be 
enhanced  is  its  supporting  complex  and 
compound  natural  language  commands.  The 
user  needs  to  be  able  to  specify  a 
compound  natural  language  command,  such 
as 


"SHOW  THE  GSCAT  FUNCTION  WHICH  USES 
VARIABLES  DATES  AND  PRICES" 


The  ability  to  process  'and',  and  'or' 
clauses  is  necessary  for  further 
flexibility  and  functionality. 

SAT  also  needs  to  be  able  to 
understand  user  macros  or  command 
procedures.  When  a  macro  is  read  in,  SAT 
should  treat  that  segment  of  code  in  a 
special  manner.  When  the  macro  is  later 
invoked  in  the  script,  some  guesses  can 
be  made  as  to  which  variables  will  be 
modified  as  a  result  of  the  macro 
execution.  Since  a  macro  may  have 
commands  which  are  conditionally 
executed,  some  variables  may  not  be 
modified  when  a  particular  macro  i3 
invoked.  It  is  possible  that  there  is 
information  outside  the  scope  of  the 
static  code  image  that  SAT  analyzes. 
Therefore,  SAT  has  to  make  a  guess  about 
how  these  code  segments  affect  variables 
in  the  code  segments.  SAT  needs  to  be 
able  to  flag  these  variables  as  being  in 
a  gray  area  of  SAT's  ability  to 
determine,  and  what  are  known  facts 
concerning  macro  execution. 

Another  area  which  SAT  could  help  the 
user  is  by  providing  special  graphics  for 
building  a  graph  of  the  analysis  tree  as 
performed  by  DINDE  (3).  The  ability  to 
group  sequences  of  commands  together  and 
displaying  them  as  a  node  in  a  graph 
helps  the  analyst  to  abstract  his/her 
work  and  to  think  about  it  at  a  higher 
level.  By  providing  a  graph  with  the 
results  of  analysis,  reviewers  can  easily 
see  what  was  done  without  having  to  read 
the  particular  statistical  language. 

Finally,  SAT  needs  to  be  included  into 
the  design  of  a  statistical  computer 
language.  By  incorporating  SAT  into  the 
design,  SAT  would  gain  clear  knowledge 
about  what  occurs  inside  macros,  what  the 
different  array  dimensions  are  and  how 
these  affect  commands.  Further,  by 
integrating  SAT  into  a  statistics  package 
along  with  an  editor,  SAT  could  provide 
feedback  on  script  changes  as  the  user 
makes  them. 

SAT  is  a  first  step  in  providing  a 
comprehensive  set  of  meta-tools  for  a 
statistical  computer  language  to  help  the 
analyst  document  and  understand  what  has 
been  done.  When  fully  integrated,  the 
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power  of  the  tools  will  increase  because 
the  information  about  a  workspace 
environment  can  be  directly  accessed  by 
SAT,  rather  than  guessed  at.  The  full 
extension  of  SAT  would  share  many  ideas 
with  DINDE,  but  with  a  major  difference; 
SAT  would  have  an  underlying  command 
language. 

Some  of  the  options  such  an 
environment  would  provide  are: 

—  A  script  cleaning  tool  similar  to  Lint 
to  provide  the  user  with  diagnostics 
about  his/her  script  as  to  which 
commands  are  only  informative, 
unnecessary,  and  the  like,  would  help 
the  analyst  to  streamline  the  code, 
and  to  help  find  potential  trouble 
spots. 

—  A  Macro  Learner,  a  tool  that  searches 
through  multiple  scripts  looking  for 
common  commands  sequences  which  could 
be  generalized  as  a  new  macro.  A 
prototype  has  already  been  written  by 
Russell  Almond,  a  graduate  student  at 
the  Department  of  Statistics  at 
Harvard  University. 

—  A  Storage  Manager  which  stores  on  disk 
the  various  scripts,  session  records, 
workspaces,  and  graphs  so  they  can  be 
retrieved  and  modified  with  full 
context. 

—  A  Perspective  Help  Daemon,  which 
monitors  the  data  analysis  session 
progress  and  suggests  potentially 
useful  macros.  The  Perspective  Help 
Daemon  would  work  by  comparing  the 
user  input  to  the  macro  library,  and 
if  a  close  match  was  found,  it  would 
suggest  use  of  the  macro. 

—  Creation  of  the  SAT  program  marks  the 
beginning  of  a  data  analysis 
environment  where  many  tedious 
housekeeping  chores  are  assumed  by  the 
computer.  Also,  the  computer  can 
compare  what  the  analyst  is  doing  to  a 
known  database  of  previous  sessions 
and  libraries  of  macros  to  suggest 
alternative  methods. 

Right  now,  we  are  enjoying  the 
hardware  and  software  of  small  powerful 
machines.  Analyzing  large  databases  and 
more  importantly  doing  multiple  analysis 
quickly  is  something  we  can  all  do. 

Dreams  of  15  years  ago  are  our  reality. 
Along  with  this  reality,  we  have 
discovered  the  meta  problem  of 
maintaining  coherency  in  our  analysis 
paths.  We  can  do  so  much  so  quickly  we 
now  must  pay  attention  to  organizing  and 
knowing  our  voluminous  output. 

Developing  the  software,  such  as  SAT, 
is  going  to  be  an  exciting  area  of 
research  both  because  of  tough  technical 
problems  and  to  provide  easy  access  for 
the  unsophisticated  user  to  these  new 
high  power  tools. 
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Taylor  approximations  for  the  variances  of  the 
approximate  distributions  of  statistics  computed 
from  complex  surveys  are  outlined.  A  program 
implementing  variance  estimation  on  the  IBM-PC 
for  use  with  large  scale  surveys  is  described. 

The  program  will  compute  estimators  and  estimated 
variances  for  totals,  ratio  subpopulation  means 
and  regression  coefficients. 


Introduction 


Most  large  scale  surveys  of  human  population 
are  of  relatively  complex  design.  Typically  the 
population  is  subdivided  into  subgroups,  called 
strata,  and  Independent  samples  selected  from 
each  stratum.  Sampling  rates  are  often  different 
in  different  strata.  Also  it  is  common  to  select 
individuals  in  clusters.  Examples  of  such  clus¬ 
ters  include  all  persons  living  in  a  geographic 
area  such  as  village  and  all  persons  in  a 
particular  housing  unit.  Stratification  and 
clustering  do  not  exhaust  the  complexities 
present  in  most  surveys,  but  they  are  sufficient 
to  explain  why  most  samples  cannot  be  treated  as 
simple  random  samples  of  individuals. 

The  description  of  stratified  cluster  samples 
also  establishes  the  three  main  components  that 
determine  the  way  an  observation  is  treated  in  an 
analysis  of  survey  data.  These  are  the  (1) 
stratum  to  which  the  individual  belongs,  (2)  the 
primary  sampling  unit  (cluster)  to  which  the 
individual  belongs,  and  (3)  the  weight  (equal  to 
the  inverse  of  the  selection  probability)  for  the 
individual.  The  data  record  for  an  individual 
used  in  a  statistical  analysis  must  contain  these 
three  components. 

The  second  dimension  of  survey  analysis  that 
requires  special  consideration  is  the  volume  of 
estimates  produced.  The  basic  output  of  most 
surveys  is  a  large  number  of  tables,  most  of 
which  are  two-way  tables.  Given  the  typical 
survey  design,  each  entry  in  the  table  is  a 
rather  complex  function  of  the  observations. 
Consider  an  estimate  of  the  fraction  of  females 
26  through  30  years  of  age  that  are  employed. 

For  a  stratified  cluster  sample,  this  estimate  is 
defined  by 
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if  the  individual  is  an  employed 
female  26  through  30  years  of  age 


otherwise. 


if  the  individual  is  a  female  26 
through  30  years  of  age 


0  otherwise, 


w(ijk)  is  the  weight  for  the  k-th  individual  in 
the  j-th  cluster  of  the  i-th  stratum,  m(ij)  is 
the  number  of  individuals  in  the  j-th  cluster  of 
the  i-th  stratum,  n(i)  is  the  number  of  clus¬ 
ters  in  the  i-th  stratum,  and  L  is  the  number 
of  strata.  The  clusters  are  the  primary  sampling 
units  and  the  estimator  of  0  is  the  ratio  of 
sample  means  of  cluster  totals.  As  such,  it  is  a 
nonlinear  function  of  the  cluster  means.  It 
follows  that  a  method  appropriate  for  nonlinear 
functions  must  be  used  to  estimate  the  variance 
of  the  approximate  distribution  of  the  esti¬ 
mator.  See  Wolter  (1985)  for  a  discussion  of 
variance  estimation  for  complex  surveys.  The 
Taylor  method  (method  of  statistical  differen¬ 
tials)  is  used  in  PC  CARP.  For  the  ratio 
estimator,  the  variance  is  estimated  by 
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and  N(i)  is  the  population  number  of  clusters  in 
the  i-th  stratum.  The  variance  of  the  ratio 
estimator  is  given  in  such  standard  tests  as  that 
of  Cochran  (1977). 

The  project  to  develop  statistical  software 
for  complex  surveys  is  a  joint  undertaking 
between  Iowa  State  University  and  the  Inter¬ 
national  Statistical  Programs  Center  of  the  U.S. 
Census  Bureau.  The  objective  is  to  provide 
developing  countries  with  software  that  can  be 
used  locally  to  process  survey  data  collected 
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locally.  The  Iowa  State  University  project  on 
variance  estimation  is  a  part  of  a  larger  project 
that  includes  the  development  of  software  for 
survey  management,  data  editing  and  tabulation. 

Beginning  in  the  early  1970's,  based  on  the 
work  of  Hidiroglou  (1974)  and  Fuller  (1975),  a 
program  was  developed  at  Iowa  State  University 
for  the  computation  of  regression  coefficients 
and  the  estimated  covariance  matrix  of  the 
coefficients  for  survey  data.  The  program, 
called  SUPER  CARP,  was  later  expanded  to  include 
total  estimation,  ratio  estimation,  subpopulation 
statistics,  two-way  tables  and  two  stage 
samples.  The  last  revision  of  SUPER  CARP  took 
place  in  1980.  That  program  furnished  the 
starting  point  for  the  development  of  PC  CARP. 

The  IBM  Personal  Computer  XT  was  chosen  by  the 
Census  Bureau  as  the  equipment  for  which  the 
software  was  to  be  designed.  The  personal 
computer  seems  an  ideal  machine  for  developing 
countries  for  several  reasons.  First,  it  Is 
relatively  tolerant  of  its  environment,  both 
physical  and  personal.  When  compared  to 
mainframe  computers,  the  personal  computer  can 
accept  greater  variation  in  temperature, 
humidity,  and  of  electric  current.  The  personal 
computer  also  has  much  lower  requirements  for 
trained  operators  and  maintenance  personnel.  See 
Diskin  (1985)  for  a  description  of  the  problems 
developing  countries  face  in  maintaining  trained 
staff.  The  personal  computer  is  under  the  direct 
control  of  the  user  and  if  a  personal  computer  is 
placed  In  a  survey  unit,  access  to  the  computer 
becomes  relatively  easy.  (Nothing  is  ever 
guaranteed  in  a  bureaucracy.)  Finally,  the 
personal  computer  is  inexpensive.  A  superior 
configuration  for  the  program  under  the  develop¬ 
ment  costs  about  $6,000. 

The  IBM  Personal  Computer  XT  is  equipped  with 
a  hard  disk  drive,  one  floppy  disk  drive  and  a 
monitor.  In  addition,  PC  Carp  requires  that  the 
machine  have  256K  bytes  of  memory  and  a  math 
coprocessor.  The  program  is  written  almost 
entirely  in  FORTRAN.  The  FORTRAN  language  was 
chosen  because  it  is  the  most  widely  known 
scientific  programming  language;  hence,  if 
necessary,  the  program  can  be  easily  modified  to 
suit  particular  needs  of  the  user.  The  IBM 
Professional  FORTRAN  compiler  was  selected  for 
the  project.  A  small  portion  of  the  code  -  some 
sections  of  the  user  interface  -  Is  written  in 
IBM  Assembly  language.  The  program  runs  under 
DOS  operating  system,  Version  3.0. 

II.  Program  Capability 

PC  CARP  is  capable  of  handling  both  large  and 
small  data  sets  with  equal  ease  and  efficiency. 

It  is  most  desirable  to  store  large  data  sets  on 
the  hard  disk  because  of  its  large  capacity  and 
the  speed  at  which  data  can  be  transferred.  If 
the  hard  disk  Is  not  available,  large  data  sets 
may  be  stored  on  a  series  of  floppy  diskettes.  A 
single  floppy  diskette  is  usually  sufficient  for 
a  small  data  set.  The  program  Is  also  capable  of 
accepting  data  entered  from  the  keyboard  during 
execution.  The  program  sets  no  limit  on  the 
number  of  strata  or  clusters  that  can  appear  in  a 
data  set  and  a  data  set  may  have  up  to  50  Input 
variables.  The  program  accepts  disk  data  files 


in  either  fixed  or  internal  (binary)  format. 

Along  with  the  data,  the  user  has  the  option  of 
providing  stratum  sampling  rates  (f(i))  .  These 
rates  are  kept  on  a  diskfile  separate  from  the 
data  and  are  used  in  variance  computations. 

The  program  can  be  used  to  confute  variances 
for  one  or  two  stage  samples.  An  example  of 
variance  estimation  for  the  ratio  estimator  in  a 
one  stage  sample  is  given  above.  The  relevant 
variance  is  within  strata-between  cluster  vari¬ 
ance  component.  In  a  two-stage  sample,  a  second 
component,  the  within  cluster  variance  component, 
also  enters  the  variance  expression.  The  program 
computes  within  cluster  sampling  rates  (f(ij)) 
from  the  stratum  sampling  rates  and  the  individ¬ 
ual  record  weights.  The  within  cluster  sampling 
rate  is  f(ij)  *  l/(w(ijk)f (i))  .  These  are 
sequentially  written  into  a  diskfile.  The  second 
component  is  added  to  the  variance  estimators  if 
the  user  selects  the  two-stage  option. 

For  purposes  of  variance  computation,  the  user 
may  instruct  the  program  to  eliminate  one  cluster 
strata  by  selecting  the  collapse  option.  If  this 
option  is  chosen,  a  one  cluster  stratum  is 
grouped  with  the  following  stratum  in  the  data 
set  by  changing  the  stratum  and  cluster  identifi¬ 
cations  on  the  involved  records.  To  illustrate 
stratum  collapse  consider  a  simple  data  set 
composed  of  three  strata,  one  of  which  contains  a 
single  cluster. 


Input  Data: 
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C1U8 

Data 

1 

1 

Record  1 

1 

2 

Record  2 

2 

1 

Record  3 

3 

1 

Record  4 

3 

2 

Record  5 

The  algorithm 
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the  second  stratum. 

represented  by  only  one 

cluster, 

with  the  third 

stratum.  The 

number  of 

data  records  and  unique 
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The  collapsed  data  set 
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2 

1 
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2 

2 
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3 
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records 

are  written  to  a  new 

"collapsed"  data  file  which  is  retained  after  the 

user  exits  the 

program. 

If  stratum  sampling 

rates  are  present,  new 
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i  ,  with 

n(i)  - 

1  cluster,  has 

been  combined 

with  stratum  i+1 

.  These  new 

rates  are  saved  in  an  auxiliary 

rate  file.  One 

can  see  from  the  example  that  different  orderings 
of  the  strata  may  produce  different  collapsed 
data  sets  and  different  collapsed  stratum 
rates.  A  preliminary  pass  through  the  data  is 
necessary  when  either  the  collapse  or  the  two- 
stage  options  is  selected. 


III.  Available  Analyses 

Table  1  contains  a  description  of  the  types  of 
statistics  available  to  the  user  and  of  the 
nature  of  the  computations  required  to  obtain  the 
estimates.  A  "Y"  in  the  column  headed  "Cov" 
means  that  the  covariance  matrix  of  a  vector  of 
estimates  of  the  type  listed  on  the  left  can  be 
obtained.  The  design  effect,  denoted  by  DEFF,  is 
available  as  an  option  for  many  of  the  statis¬ 
tics.  See  Kish  (1965)  for  a  description  of  the 
design  effect. 

The  population  (Total  and  Ratio)  analyses  and 
stratum  analyses  are  performed  in  a  straight¬ 
forward  manner.  Some  details  pertaining  to 
Subpopulation  Analyses,  the  Two-Way  Table  and  the 
Regression  Analysis  are  given  below. 


-  Cell  totals  along  with  marginal  row  and 
column  totals 

-  Conditional  row  proportions  for  each  cell 

-  Conditional  column  proportions  for  each  cell 

-  Cell  proportions  along  with  the  marginal  row 
and  column  proportions. 

Standard  errors  are  computed  for  all  of  the  above 
estimators.  Also,  a  test  statistic  for  the 
hypothesis  of  independence  is  output. 

The  weighted  least  squares  regression  analysis 
computes  coefficient  estimates  and  an  estimated 


Table  1.  Analysis  capability  of  PC  CARP 


Analysis 

Cov 

DEFF 

Comments 

Total  Estimation 

Y 

Y 

1  pass;  40  variables 

Ratio  Estimation 

Y 

Y 

1  pass;  50  variables 
without  covs.  15  variables 
with  covs. 

Stratum  Totals 

Y 

Y 

1  pass;  40  variables 

Stratum  Means 

N 

Y 

« 

1  pass;  50  variables 

Stratum  Proportions 

N 

Y 

2  passes;  50  variables 

Subpopulation  Analyses: 

Totals 

N 

Y 

Crossed  classifications; 

Means 

N 

Y 

Multiple  dependent 
variables;  Multiple  passes. 

Proportions 

N 

Y 

Two-Way  Table 

N 

N 

50  cells;  1  pass/dependent; 
Tests  of  Independence 

Regression  (WLS) 

Y 

N 

2  passes;  35  variables; 
Multiple  degrees  of  freedom 
hypotheses  tests;  Residuals 
and  predicted  values 

NOTE:  Coefficients  of  Variation  are  computed  for  all  estimators. 


The  subpopulation  analyses  give  the  user  the 
option  of  crossing  classif ication  variables. 

This  allows  the  user  to  create  new  classification 
structures  from  two  or  more  input  variables.  For 
example,  suppose  the  input  data  includes  the 
classification  variables  age,  sex  and  education 
with  six,  two  and  five  levels,  respectively. 

Then,  by  crossing  age  with  sex  with  education,  a 
new  classification  structure  with  60  levels  is 
produced.  The  user  may  then  obtain  estimates  for 
any  number  of  dependent  variables  under  this 
classification  structure. 

The  Two-Way  Table  analysis  is  defined  by  two 
classification  variables  and  at  least  one 
dependent  variable.  Four  tables  are  then 
computed  for  each  dependent  variable: 


variance-covariance  matrix,  which  takes  into 
account  the  sample  design.  These  calculations 
are  given  in  Fuller  (1975)  and  outlined  in 
Hidlroglou  et  al  (1980).  Multiple  degrees  of 
freedom  F-tests  for  sets  of  coefficients  and  the 
usual  t-statlstlcs  are  available.  The  user  also 
has  the  option  of  obtaining  residuals  and 
•stimated  true  values. 

IV.  Program  Details  and  User  Interface 

In  PC  CARP,  certain  tasks  must  be  performed 
repeatedly,  regardless  of  the  analysis.  These 
Include  data  management,  error  handling  and 
program  output. 


The  program  relies  on  a  single  data  management 
subroutine  which  performs  the  following 
functions: 

-  Reads  data  from  the  diskfile  and  passes  it 
onto  any  of  the  subroutines  performing  data 
organization  or  analysis. 

Retrieves  and  sends  rates  (stratum  or  two- 
stage)  associated  with  each  data  record. 

-  Manages  the  set  of  files  in  which  the  data 
and  rates  are  stored. 

Isolating  these  functions  in  one  routine  allows 
an  analysis  routine  to  be  readable  and  unclut¬ 
tered  with  data  management  code.  Also,  this 
allows  all  analysis  routines  to  be  structured  in 
a  similar  way. 

In  constructing  the  error  handling  system,  the 
most  important  consideration  was  to  avoid  program 
termination  caused  by  user  misspecif lcatlons  that 
could  be  easily  corrected.  These  include  checks 
for  omitted  responses,  improper  file  names  and 
Invalid  analysis  variable  specifications.  If 
such  an  error  is  detected,  PC  CARP  allows  the 
user  to  re-enter  Information  or  exit  the  program 


unscathed.  The  program  routinely  performs  checks 
to  avoid  computational  errors  such  as  division  by 
zero. 

The  user  has  the  option  of  sending  program 
output  to  any  combination  of  diskfile,  screen  or 
printer.  Within  the  program,  output  is  formed  a 
line  at  a  time.  First,  the  output  line  is  writ¬ 
ten  to  a  "buffer",  which  is  actually  a  character 
array.  The  character  array  is  then  sent  to  a 
subroutine  which,  in  turn,  routes  it  to  the 
proper  output  device(s).  As  with  data  manage¬ 
ment,  this  approach  prevents  the  unnecessary 
repetition  of  output  statements. 

Two  primary  concerns  at  the  program  develop¬ 
ment  stage  were  to  have  a  friendly  user  interface 
and  to  minimize  the  number  of  passes  through  the 
data.  The  interface  was  made  user  friendly  by 
implementing  an  Interactive,  screen  oriented 
response  system,  while  a  single  pass  algorithm 
for  variance  estimation  helped  minimize  the 
amount  of  reading  from  data  files. 

When  information  is  needed  by  PC  CARP,  the 
user  receives  a  full  screen  of  short  response 
questions  along  with  detailed  instructions.  The 
first  set  of  screens  displayed  to  the  user  ask 
for  information  pertaining  primarily  to  data 
organization  and  location.  One  such  screen  is 
pictured  below. 


PC  CARP  -  Problem  Specification 

INSTRUCTIONS:  Key  the  problem  identification  and  press  the  ENTER  KEY.  Next 
use  the  "arrow"  keys  on  the  numeric  keypad  to  position  the  cursor.  Key  a 
response  to  each  and  every  requested  item.  Responses  replace  slashes  shown  on 
the  screen.  When  you  have  finished  keying  responses,  press  the  "END"  key  on 
the  numeric  keypad  (lower  right  side  of  keyboard). 

***  ***  *** 

1.  Type  a  problem  name(id)  on  the  next  line,  replacing  the  slashes 

////////////////////////////////////// 

2.  Give  the  total  number  of  variables  input  (Replace  //,  Example  09)  .  .  .  / 

3.  Do  you  wish  to  have  an  intercept  variable  generated?  (Respond  Y  or  N)  .  / 

4.  Do  you  wish  to  enter  the  variable  nameB  using  the  keyboard? . / 

5.  Specify  (Y  or  N)  in  each  case  whether  the  following  items  will  be  input 
with  the  observations.  (Unit  weights  are  the  default) 

a.  STRATUM  ID . / 

b.  CLUSTER  ID . / 

c.  WEIGHT . / 

6.  Do  you  wish  to  enter  the  data  using  the  keyboard?  (Respond  Y  or  N)  .  .  .  / 

7.  Will  stratum  sampling  rates  be  provided?  (Respond  Y  or  N)  Note  that  the 

response  to  5a.  above  must  be  "Y"  If  sampling  rates  are  provided  .  / 

8.  Is  this  a  two  stage  sample?  (Y  or  N,  If  “Y"  then  response  to  7  is  "Y")  / 

9.  Output  from  analysis  is  routed  to  (Y  or  N) 

a.  PRINTER . / 

b.  SCREEN . / 

c.  DISK . ! 

10.  Do  you  wish  to  collapse  strata  (to  avoid  single  unit  in  a  stratum)  .  .  / 
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It  Is  representative  of  the  screens  in  PC  CARP  in 
that  the  user  moves  the  cursor  about  the  screen 
replacing  slashes  (///)  with  Y  or  N,  numeric  or 
short  answer  responses.  Other  screens  will 
appear  for  data  file  information  and  variable 
name  specification.  After  information  pertaining 
to  the  data  is  input,  the  remaining  screens  allow 
the  user  to  choose  their  analysis,  available 
options  and  analysis  variables.  The  particular 
screen  which  appears  for  variable  selection 
depends  upon  the  choice  of  analysis.  For  exam¬ 
ple,  if  a  subpopulation  analysis  is  specified, 
the  user  must  key  in  classification  variables, 
upper  bounds  for  the  number  of  levels,  crossing 
Indicators  and  dependent  variables.  If  ratio 
estimation  is  specified,  the  numerator  and 
denominator  variables  are  entered  in  pairs  for 
each  ratio.  The  analysis  specification  and 
variable  selection  screens  will  appear  for  each 
analysis  the  user  wishes  to  perform.  PC  CARP 
produces  the  screens  using  FORTRAN  formatted 
write  statements  while  cursor  movement  and 
response  retrieval  are  Supported  by  assembly 
language  routines. 

Up  to  three  different  variance  quantities  can 
be  accumulated  concurrently  for  any  given 
estimator.  These  are  the  first  stage  variance 
component,  the  optional  second  stage  variance 
component  and  the  simple  random  sampling  variance 
used  in  the  computation  of  the  design  effect. 

Each  variance  is  accumulated  using  a  single  pass 
algorithm  for  weighted  means  and  weighted  sums  of 
squares  and  cross  products  matrices.  The 
algorithm  is  described  in  Herraman  (1968).  By 
computing  all  variance  quantities  in  a  single 
pass  through  the  data,  a  large  amount  of  array 
space  is  needed.  However,  the  elimination  of 
entire  passes  through  the  data  outweighs  the  use 
of  additional  array  space. 
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66014.  We  thank  Heon  Jin  Park  for  his  assistance 
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THE  STATUS  OF  COMPUTER- ASSISTED  TELEPHONE  INTERVIEWING 


’..’illiam  L.  Nicholls  II,  Bureau  of  the  Census  and  Robert  M.  Groves,  University  of  Michigan 


Computer-assisted  telephone  interviewing,  or 
CATI ,  lies  on  the  interface  between  computer 
science  and  statistical  data  collection.  It 
employs  interactive  computing  systems  to  assist 
interviewers  and  their  supervisors  in  performing 
the  basic  data  collection  tasks  of  telephone 
interviewing.  This  paper:  (1)  presents  a 

definition  of  CATI;  (2)  reviews  its  growth  and 
current  status  as  a  new  data  collection  tech¬ 
nology;  and  (3)  summarizes  available  evidence  on 
its  consequences  for  survey  interviewing  costs 
and  data  quality. 

1 .  Definitions 

Computer-assisted  telephone  interviewing  is 
part  of  a  broader  family  of  technologies  called 
"computer-assisted  data  collection."  In  addition 
to  CATI,  this  family  includes:  (1)  computer- 
assisted  personal  interviewing  (CAPI)  which 
employs  portable  microcomputers  for  interviews 
in  respondents'  homes  or  offices;  and  (2) 
computerized  self-administered  questionnaires 
(CSAQ)  in  which  similar  equipment  is  operated 
directly  by  respondents.  All  three  technologies 
may  employ  similar  hardware  and  software;  but 
their  data  collection  characteristics  probably 
vary  with  their  usage  and  with  the  settings  in 
which  they  are  employed. 

At  least  in  principle,  CATI  and  CAPI  provide 
interviewers  with  the  same  types  of  online 
interviewing  assistance.  In  state-of-the-art 
CATI  systems: 

a.  The  system  displays  instructions,  survey 
questions,  and  response  categories  on  the 
interviewers'  screens. 

b.  The  screen  may  contain  "fills"  or 
alterations  of  the  display  text  based  on 
prior  answers  or  batch  input. 

c.  Answers  to  closed  questions  may  be  entered 
by  numeric  or  alphabetic  codes.  These  and 
other  numeric  entries  may  be  edited  by  sets 
of  permissible  values,  by  ranges,  or  by 
logical  or  arithmetic  operations. 

d.  Edit  failures  result  in:  (1)  an  unaccepted 
entry  and  error  message  requiring  another 
attempt;  or  (2)  in  display  of  additional 
probes  or  questions  to  be  asked. 

e.  Extended  text  answers  may  be  entered  to 
open-ended  questions. 

f.  Branching  or  skipping  to  the  next  item  is 
automatic  and  may  be  based  on  logical  or 
arithmetic  tests  on  any  prior  entries  or 
input  data. 

g.  Interviewers  may  interrupt  and  resume 
interviews  in  mid-course;  review,  backup 
to,  and  (when  permitted  by  the  survey 
design)  change  prior  entries;  and  enter 
interviewer  notes  at  appropriate  points. 

This  paper  focuses  on  computer-assisted  tele¬ 
phone  interviewing  and  even  more  specifically  on 
multi-station  CATI  systems.  One-station  CATI 
systems  exist  (Philipp  and  Cicciarella,  1983)  in 
which  each  interviewer  independently  operates  a 
stand-alone  microcomputer  to  complete  online 
interviews  with  assigned  batches  of  cases.  In 
multi-station  systems,  the  interviewing  stations 


are  linked  or  networked  to  a  common  host.  This 
permits  many  additional  case  management  features 
including;  system  assignment  of  cases  to  inter¬ 
viewers;  shared  workloads  and  system  scheduling 
of  telephone  calls  and  callbacks;  online  visual 
and  audio  monitoring  of  interviewers  from  super¬ 
visory  stations,  and  common  record  keeping. 
While  some  of  CATl's  consequences  for  survey 
costs  and  data  quality  derive  from  online  inter¬ 
viewing  features  shared  with  CAPI  and  one- 
station  CATI  systems,  others  result  from  the 
case  management  and  supervisory  capabilities  of 
multi-station  systems. 

A  CATI  system  may  provide  the  equivalent  of  a 
blank  questionnaire  on  which  the  interviewer 
enters  a  case  number  and  other  input  data  before 
placing  a  call.  More  commonly  the  system 
contains  a  file  of  sample  cases  identified  by 
case  numbers.  The  interviewer  may  access  a  case 
in  one  of  two  ways:  (1)  by  case  call-up,  where 
the  interviewer  enters  the  identification  number 
of  a  selected  case;  or  (2)  by  online  case 
assignment  and  call  scheduling  where  the  inter¬ 
viewer  indicates  readiness  for  an  interview  and 
the  system  supplies  a  case  appropriately  called 
at  that  time.  This  reduces  the  need  for  inter¬ 
viewer  maintenance  and  review  of  paper  or 
displayed  listings  in  choosing  cases  to  call. 

Current  CATI  systems  differ  greatly  in  their 
use  of  online  call  scheduling.  Some  lack  this 
capability,  some  limit  its  use  to  previously 
uncalled  cases,  and  some  employ  it  only  up  to 
the  point  where  the  sample  household  or  office 
is  reached.  Others  use  it  for  virtually  all 
calls  except  problem  callbacks,  such  as  recalls 
to  initial  refusals  and  missed  appointments. 
The  range  of  circumstances  to  which  online  call 
scheduling  is  applied  should  affect  both  inter¬ 
viewer  productivity  levels  and  measures  of  data 
quality  dependent  on  the  frequency  and  effici¬ 
ency  of  calling. 

2 .  History  and  Status 

Market  research  agencies  in  the  private 
sector  created  the  first  computer-assisted  tele¬ 
phone  interviewing  systems  and  established 
initial  expectations  of  CATl's  data  collection 
characteristics.  Based  on  experiences  in  the 
first  CATI  survey,  conducted  by  Chilton  Research 
for  AT&T  in  1971,  Nelson,  Peyton,  and  Bortner 
(1972)  described  "three  distinct  advantages"  for 
cathode  ray  tube  interviewing  (as  it  was  then 
called)  in  comparison  with  conventional  data 
collection  methods.  These  were:  "accuracy, 
speed,  and  reduced  costs." 

Throughout  the  1970' s  and  early  1980' s,  the 
number,  size,  and  sophistication  of  commercial 
market  research  CATI  systems  increased  rapidly 
(Dutka  and  Frankel,  1980;  Fink,  1983;  Smith  and 
Smith,  1980).  Single  installations  of  100  or 
more  stations  now  exist,  as  do  networks  of 
geographically  dispersed  field  sites.  At  least 
100  CATI  installations  are  in  operation  by 
commerical  market  research  agencies.  The 
majority  are  in  the  United  States;  but  market 
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research  firms  using  CATI  are  also  found  in 
Australia,  Canada,  Italy,  Great  Britain,  the 
Netherlands,  New  Zealand,  Sweden,  Switzerland, 
and  West  Germany. 

University  research  centers  began  their 
largely  independent  development  of  CATI  five 
years  after  its  introduction  in  market  research. 
The  UCLA  Center  for  Computer-Based  Behavior 
Sciences  led  the  way;  and  the  Center's  Director, 
Gerald  Shure,  coined  the  CATI  name  and  acronym 
(Shure  and  Meeker,  1978).  Development  work  at 
the  Berkeley,  Michigan,  and  UCLA  Survey  Research 
Centers  and  the  Wisconsin  Survey  Research 
Laboratory  followed  shortly  thereafter  (Groves, 
1983;  Palit  and  Sharp,  1983;  Shanks  et  al . , 
1980) .  Academic  survey  research  centers  greatly 
expanded  the  range  of  CATI  capabilities, 
especially  for  probability  (rather  than  quota 
control)  sampling,  call  scheduling  and  callback 
routines  necessary  for  high  response  rates,  and 
greater  freedom  of  interviewer  movement.  Today 
at  least  18  university  and  private  research 
organizations,  such  as  RAND,  RTI ,  Westat,  and 
Mathematica,  employ  CATI. 

U.S.  governmental  agencies  demonstrated  an 
early  interest  in  CATI,  but  steps  to  acquire 
their  own  CATI  capabilities  did  not  begin  until 
1980  when  the  the  U.S.  Department  of  Agriculture 
Statistical  Reporting  Service  and  U.S.  Census 
Bureau  both  established  internal  staffs  for  this 
purpose  (House,  1984;  Nicholls,  1983).  Both 
completed  their  first  tests  of  CATI  in  1982  and 
have  continued  CATI  testing  and  production  data 
collection  since  that  time.  In  the  Netherlands, 
the  Central  Bureau  of  Statistics  began  CATI  data 
collection  for  continuing  surveys  in  1984. 

A  major  expansion  of  governmental  CATI 
installations  began  in  1985.  The  USDA  Statis¬ 
tical  Reporting  Service  started  placement  of 
production  CATI  facilities  in  state  offices  with 
the  goal  of  completing  this  task  by  January 
1989.  The  Centers  for  Disease  Control  began 
installing  CATI  in  25  state  offices  (most  of 
these  two-station  sites)  for  use  in  a  continuing 
survey.  The  Bureau  of  Labor  Statistics  began 
testing  CATI  production  on  three  surveys.  The 
U.S.  Census  Bureau  opened  a  40-station  CATI 
facility  for  production  data  collection  and 
expanded  programs  of  testing  on  demographic 
surveys  while  its  Business  Division  began  auto¬ 
mating  selected  economic  surveys  with  a  system 
including  CATI  functions.  The  Census  Bureau  is 
currently  developing  plans  for  full  CATI  and 
CAPI  implementation  on  the  Current  Population 
Survey  and  the  National  Crime  Survey.  If 
further  evaluative  tests  prove  successful  and 
the  survey’s  sponsoring  agencies  approve,  this 
major  change  in  data  collection  methods  should 
be  completed  in  the  1990's.  Outside  the  U.S., 
Statistics  Sweden  began  procurement  In  1985  for 
an  integrated  prototype  CATI/CAPI  system  for  its 
major  surveys;  and  this  year  Statistics  Canada 
will  begin  testing  CATI  in  its  Labor  Force 
Survey . 

Applications  to  governmental  data  collection 
have  placed  new  demands  on  the  design  of  CATI 
surveys.  A  common  government  application  is  in 
telephone  follow-up  to  mail  nonresponse.  This 
has  required  systematization  of  procedures  for 
telephone  tracing  of  difficult  to  reach  respond¬ 
ents  for  inclusion  in  CATI  interviewing  and 


automatic  call  scheduling  systems  (Ferrari,  1986 
and  Nicholls,  1983).  A  related  application  with 
special  promise  for  longitudinal  demographic 
surveys  is  use  of  CATI  for  second-  and  later - 
visit  interviews  after  an  initial  interview  in 
person.  In  such  designs,  personal  visits 
(eventually  by  CAPI)  may  continue  for  households 
unreachable  by  telephone.  Other  potential  uses 
by  governmental  agencies  include  failed-edit 
follow-up  and  reconciliation  reinterviewing. 

3 .  CATI 1 s  Data  Collection  Characteristics 

Survey  organizations  considering  adoption  of 
CATI  frequently  ask  the  following  questions: 
(1)  How  much  will  CATI  cost  to  install  and 
operate?;  (2)  How  will  CATI  change  the  time 
required  to  design  and  complete  a  survey?;  and 
(3)  What  effects  will  CATI  have  on  data  quality 
compared  with  conventional  methods? 

In  a  previous  paper,  the  authors  (Nicholls 
and  Groves,  1985)  attempted  to  summarize  the 
existing  literature  on  CATI  to  determine  how 
fully  each  of  these  questions  could  be  answered. 
Our  general  conclusion  was  that  the  past  litera¬ 
ture  provided  few  firm  answers  about  CATI's 
effects  on  survey  costs,  timeliness,  or  data 
quality.  This  paper  will  take  a  somewhat  more 
encouraging  stance.  Published  and  unpublished 
research  released  during  the  last  year  is 
contributing  to  a  better  (although  still  far 
from  complete)  understanding  of  at  least  some  of 
CATI's  data  collection  characteristics.  These 
include  key  factors  in  interviewing  costs  and 
selected  consequences  for  data  quality.  The 
remainder  of  this  paper  will  focus  on  these 
areas . 

By  most  standards,  this  evidence  is  still 
relatively  weak.  Four  studies  have  been 
published  which  are  (or  closely  approximate) 
controlled  experiments  in  which  probability 
subsamples  of  the  same  survey  are  interviewed  by 
CATI  and  paper  methods  at  the  same  time  by  the 
same  staff  under  controlled  conditions.  These 
are  the  SRC-Michigan  RDD  Health  Survey  Test 
(Groves  and  Mathiowetz,  1984);  the  USDA 
California  Dual  Frame  Cattle  Inventory  Survey 
(House,  1984  and  Tortora,  1985);  the  USDA 
Nebraska  Hog  Survey  (Coulter,  1985);  and  the 
Westat  Florida  Colo-Rectal  Cancer  Kin  Survey 
(Harlow,  et  al. .  1985).  Most  of  these 
experimental  studies  employed  relatively  small 
samples,  ranging  from  about  130  to  1,200  CATI 
cases  and  as  few  as  four  or  five  CATI  stations. 
All  also  represent  relatively  early  use  of  CATI 
by  their  organizations.  Comparable  information 
from  organizations  with  at  least  three  years 
CATI  production  experience  is  generally  avail¬ 
able  only  in  the  form  of  summary  impressions 
rather  than  quantitative  data  (Palit  and  Sharp, 

1984)  . 

Comparisons  of  CATI  and  paper-and-pencil  data 
collection  are  also  available  from  four  compara¬ 
tive  studies  which  do  not  meet  the  requirements 
of  fully  controlled  experiments.  The  SRC-UCLA 
Earthquake  Survey  (Fielder,  1985)  and  the  SRC- 
Berkeley  Malignant  Melanoma  Survey  (Coleman, 

1985)  approximate  before-and-after  designs. 
Earlier  waves  of  these  surveys  were  conducted  by 
paper  methods  and  later  waves  by  CATI.  The 
comparisons  are  limited  to  first-visit  inter- 
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views  in  repeated  cross-sections  or  control 
samples.  The  U.S.  Census  Bureau  also  completed 
tests  of  CATI  for  telephone  follow-up  to  mail 
nonresponse  in  the  National  Survey  of  Scientists 
and  Engineers  (Ferrari,  1984)  and  the  1982 
Census  of  Agriculture  (Ferrari,  1986).  Each 
test  assigned  probability  subsamples  of  7,000  or 
more  cases  to  each  treatment,  but  the  CATI  and 
non-CATI  staffs  worked  at  widely  separated  sites 
which  followed  different  hiring,  supervisory, 
and  management  procedures.  These  uncontrolled 
factors  and  difficulties  encountered  in  recover¬ 
ing  all  field  work  records  from  the  the  non-CATI 
site  requires  caution  in  interpreting  their 
results . 

While  each  of  these  studies  is  limited  in 
sample  size,  design,  or  experience  with  CATI, 
they  have  the  collective  strength  of  represent¬ 
ing  largely  independent  efforts  by  seven 
different  investigators  in  six  different 
organizations  utilizing  five  different  CATI 
systems.  Where  consistent  results  are  found, 
they  suggest  generalizations  which  may  apply 
across  varying  organizational  settings. 

4 .  Data  Collection  Costs 

The  total  costs  of  CATI  data  collection  will 
depend  on  many  factors,  including  the  costs  of: 
(1)  hardware  and  software  acquisition, 
installation,  and  maintenance;  (2)  CATI  survey 
design,  setup,  and  debugging;  (3)  interviewing 
costs;  and  (4)  data  preparation  costs.  Summary 
impressions  have  been  published  which  suggest 
that  total  costs  of  a  CATI  survey  will  be  less 
than  those  of  a  comparable  survey  conducted  by 
paper-and-pencil  methods,  but  these  impressions 
have  not  been  accompanied  by  supporting  detailed 
evidence  (Nelson  et  al.  ,  1972;  Palit  and  Sharp, 
1983). 

Quantitative  evidence  is  available  bearing  on 
only  one  cost  component,  interviewing  costs. 
Telephone  interviewing  may  be  divided  into  three 
main  tasks:  (1)  placing  calls  to  reach  desig¬ 
nated  respondents;  (2)  interviewing  respondents 
when  reached;  and  (3)  post- interview  clerical 
tasks,  such  as  editing  completed  forms  for 
consistency,  maintaining  records  of  calls,  and 
transcribing  data  between  forms.  Research  on 
CATI  has  most  frequently  focussed  on  interview 
length  once  respondents  have  been  reached. 

The  results  of  three  experimental  and  two 
comparative  studies  are  summarized  in  Table  1. 
House  (1984)  found  the  mean  length  of  CATI  and 
non-CATI  interviews  to  be  equal  but  reported 
problems  of  obtaining  comparable  timings  across 
modes.  The  remaining  four  studies  found  CATI 
interviews  to  be  longer.  Groves  and  Mathiowetz 
(1984)  and  Harlow  et  al.  (1951)  reported  CATI 
interviews  about  13-14  percent  longer  in  surveys 
where  the  telephone  was  the  primary  mode  of  data 
collection.  Coleman  (1985)  found  CATI 
interviews  22  percent  longer  but  part  of  the 
difference  was  attributable  to  added  questions 
in  the  CATI  questionnaire.  Ferrari  (1984) 
reports  CATI  interviews  50  percent  longer  in  one 
use  of  telephone  interviews  for  follow-up  to 
mail  nonresponse,  but  the  CATI  timings  included 
additional  activities  and  also  were  accompanied 
by  substantially  lower  rates  of  item 
nonresponse.  While  some  uncertainties  of 


measurement  are  present  in  all  five  studies, 
collectively  they  suggest  that  CATI  interviews 
tend  to  be  at  least  somewhat  longer  than 
comparable  paper-and-pencil  interviews. 

TABLE  1 

MEAN  LENGTH  OF  COMPLETED  INTERVIEW 
IN  MINUTES  BY  MODE  IN  FIVE  SURVEYS 


Survey 

CATI 

Non-CATI 

USDA  Cattle  Multiple  Frame 

Survey  (House,  1984) . 

8.2 

8.2 

SRC-Michigan  National  Health 
Survey  Test  (Groves  and 
Mathiowetz,  1984) . 

52 

46 

Westat  Colo-Rectal  Cancer  Survey 
(Harlow  et  al.,  1985) . 

28.5 

25.1 

Malignant  Melanoma  Study 

(Coleman,  1985) . 

10.9 

8.9 

U.S.  Census  Bureau  Survey  of 
Scientists  and  Engineers  Tele¬ 
phone  Follow-Up  (Ferrari,  1984) 

20.8 

13.7 

Although  CATI  relieves  the  interviewers  of 
the  task  of  turning  pages  and  of  finding  the 
next  question  to  ask,  several  hypotheses,  none 
confirmed,  have  been  advanced  to  explain  the 
apparently  longer  length  of  CATI  interviews . 
First,  experienced  paper-and-pencil  inter¬ 
viewers  often  begin  asking  the  next  question 
while  recording  the  last.  With  CATI,  this  is 
more  difficult  because  the  next  question  often 
is  not  displayed  until  the  answer  to  the  prior 
question  is  entered.  Second,  entering  responses 
to  open-ended  questions  may  take  longer  in  CATI 
because  most  interviewers  write  somewhat  faster 
chan  they  type  (Groves  and  Mathiowetz,  1984). 
Third,  to  the  extent  that  CATI  ensures  comple¬ 
tion  of  items,  probes,  or  other  interviewing 
tasks  occasionally  missed  with  paper-and-pencil 
methods,  this  also  will  lengthen  CATI  inter¬ 
views.  Further  increases  in  length  will  occur 
when  the  CATI  questionnaire  includes  edit  checks 
requiring  added  probes  or  other  interviewer 
actions  to  reconcile  apparenc  inconsistencies 
(Morton  and  House,  1983). 

Although  CATI  interviews  tend  to  be  longer, 
CATI  interviewers  may  spend  less  time  between 
interviews.  With  efficient  online  call  sched¬ 
uling  and  case  assignment,  CATI  systems  should 
reduce  interviewer  time  selecting  cases  to  call 
and  maintaining  call  records.  Automatic 
branching  between  items,  online  editing,  and  the 
recording  of  entries  directly  on  computer  files 
should  reduce  the  need  for  post- interview 
clerical  review  of  completed  forms  and  trans¬ 
cribing  data  between  forms.  The  net  effect  may 
be  an  increase  in  interviewer  productivity. 
Nelson  et  aj..  ,  (1972)  reported  a  10  percent 
increase  in  interviewer  productivity  in  the 
first  CATI  survey  conducted  in  1971.  In  1983, 
Palit  and  Sharp  (1983)  reported  a  20  percent 
increase  in  interviewer  productivity  (measured 
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in  sample  points  contacted  per  production  hour) 
compared  with  paper  methods  in  random  digit 
dialing  telephone  surveys.  Neither  paper 
describes  the  methods  by  which  the  comparisons 
were  made  nor  presents  supporting  data. 

Estimates  from  two  recent  studies,  shown  in 
Table  2,  suggest  that  the  productivity  of  CATI 
interviewers  may  depend  on  the  use  of  online 
call  scheduling  and  case  assignment.  Using  a 
CATI  system  without  these  features,  Coulter 
(1985)  reported  CATI  interviewers  12  percent 
less  productive  than  paper-and-pencil 
interviewers  when  productivity  was  measured  by 
the  combination  of  completed  and  refused  inter¬ 
views  per  hour.  By  contrast,  in  the  Census  of 
Agriculture  telephone  follow-up  which  made 
extensive  use  of  online  call  scheduling  and  case 
assignment,  Ferrari  (1986)  found  CATI 
interviewers  45  percent  more  productive  than 
paper-and-pencil  interviewers  when  measured  by 
completed  interviews  per  paid  interviewer  hour. 
Productivity  appears  to  have  been  increased  by 
reducing  the  time  spent  between  interviews.  The 
CATI  interviewers  placed  23  percent  more  calls 
per  hour  and  spent  31  percent  more  time  on  the 
phone  than  the  paper-and-pencil  interviewers. 


TABLE  2 


ESTIMATES  OF  INTERVIEWER  PRODUCTIVITY  BY  MODE 
IN  TWO  SURVEYS 


Survey  and  Measure 

CATI  1 

Non-CATI 

USDA  Nebraska  Hog  Survey 
(Coulter,  1985) 

Completions  and  refusals 

per  interviewer  hour.... 

5.3 

6.0 

Sample  size  in  cases . 

Census  of  Agriculture  Telephone 
Follow-Up  (Ferrari,  1986) 

(550) 

(575) 

Completed  interviews  per 

paid  interviewing  hour.. 
Telephone  calls  per  paid 

1.06 

0.73 

interviewing  hour . 

Minutes  phoning  per  paid 

10.42 

8.45 

interviewing  hour . 

49.39 

37.77 

Sample  size  in  cases . 

(7,688) 

(5,427) 

The  increased  productivity  reported  for  CATI 
by  Ferrari  is  Inflated  by  CATI's  higher  response 
rate  and  perhaps  by:  (1)  failures  of  the  paper- 
and-pencil  interviewers  to  accurately  record  the 
number,  timings,  and  outcomes  of  calls;  (2)  by 
analysis  methods  necessary  to  circumvent  the 
loss  of  many  paper  call  records;  and  (3) 
previously  mentioned  uncontrolled  factors 
this  comparative  study.  Nevertheless, 
direction  of  the  results  is  consistent  with  the 
summary  impressions  of  agencies  which  conduct 
CATI  surveys . 
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Interviewer  productivity  is  only  one 
component  of  total  interviewing  costs.  The 
initial  training  of  CATI  interviewers  is  often 
believed  to  take  longer  since  they  must  learn  to 
operate  a  computer  terminal  or  microcomputer. 
CATI's  consequences  for  interviewing  supervision 
are  less  clear  and  may  depend  on  the  tasks  that 
supervisors  are  assigned.  Initial  supervisory 
training  may  require  more  time  for  CATI,  but  a 
system  with  online  call  scheduling,  case  assign¬ 
ment,  and  automatic  record  keeping  should  free 
the  supervisors  of  most  clerical  and  report 
preparation  tasks  and  eliminate  the  need  for 
clerical  support.  This  may  provide  more  time 
for  direct  supervision  and  monitoring  of 
interviewers . 

To  date,  only  one  study  has  attempted  to 
include  these  elements  in  cost  comparisons  of 
CATI  and  non-CATI  data  collection.  Table  3 
presents  cost  projections  Ferrari  (1986)  based 
on  comparative  data  from  the  Census  of  Agri¬ 
culture  but  adjusted  for  the  differing  pay  rates 
and  supervisory  practices  at  the  CATI  and 
non-CATI  sites.  Data  entry  salaries  also  are 
included  since  CATI  data  entry  occurs  simultan¬ 
eously  with  interviewing.  Searching  telephone 
directories  for  respondents'  numbers  when 
unknown  and  agricultural  analyst  review  of 
completed  interviews  for  content  consistency  and 
completeness  were  assumed  to  require  the  same 
cost  per  case  in  both  methods. 


TABLE  3 


PROJECTED  INTERVIEWING  AND  KEYING  SALARY  COSTS 
PER  CASE:  CENSUS  OF  AGRICULTURE* 


Activity 

CATI 

Non-CATI 

Interviewer  training. . . 

$1.09 

$  .39 

Interviewing . 

2.48 

2.07 

Interviewer  supervision 

.77 

.44 

Clerical  support . 

.18 

Tel.  number  research... 

.11 

.11 

Data  keying . 

-- 

1.26 

Analyst  review . 

.08 

.08 

Total  per  case .... 

$4.53 

$4.53 

Total  per  complete 

$8.83 

$9.79 

*Ferrari,  1986. 


In  Ferrari's  analysis,  CATI  has  higher  costs 
per  assigned  case  in  three  areas:  interviewer 
training,  interviewing,  and  interviewer  super¬ 
vision.  The  higher  supervisory  costs  include 
both  added  training  for  the  supervisors  and  a 
higher  supervisor-to-employee  ratio  than  used 
previously  in  this  survey.  At  the  same  time, 
CATI  achieves  savings  by  eliminating  clerical 
staff  and  data  keying.  When  summed,  total 
salary  costs  for  CATI  and  paper  methods  are 
equal  per  assigned  case.  However,  since  CATI 
obtained  a  higher  response  rate  in  this  study, 
CATI's  total  salary  costs  per  completed 
interview  were  less,  by  11  percent. 
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These  projections  are  based  on  the  use  of 
CATI  for  telephone  follow-up  to  mail  nonresponse 
and  may  not  apply  to  other  applications  or 
organizational  settings.  However,  the  analysis 
suggests  the  types  of  data  required  to  begin 
assessments  of  CATI's  cost-effectiveness  as  a 
data  collection  method.  Future  analyses  should 
include  the  salaries  of  professional  and 
technical  staff  in  survey  design  and  in 
processing  as  well  as  nonsalary  costs,  such  as 
amortization  of  CATI  and  key  entry  hardware  and 
duplication  of  paper  forms . 


staff  obtained  a  significantly  lower  refusal 
rate.  This  may  be  the  result  of  the  additional 
calls  and  time  on  the  telephone  made  possible 
for  the  CATI  staff  by  automatic  call  scheduling 
and  case  assignment.  The  CATI  staff  placed  76 
percent  more  calls  and  averaged  twice  as  much 
time  on  the  phone  with  cases  finally  classified 
as  refusals.  The  difference  also  may  be  attrib¬ 
utable  to  the  previously  described  uncontrolled 
factors  in  this  comparative  study. 


REFUSAL  RATES  BY  MODE  IN  FIVE  SURVEYS 


One  of  the  most  common  speculations  of  the 
small  literature  on  CATI  is  that  CATI  will 
improve  the  quality  of  data  collected  in  tele¬ 
phone  surveys  (Groves,  1983;  House,  1984;  Nelson 
et  al.  ,  1972;  Nicholls,  1978;  Rustemeyer  et  al . . 
1978;  Shanks,  1983).  Others  have  suggested  ways 
in  which  CATI  may  lower  data  quality  (Harlow  et 
al. ,  1985;  Presser,  1983).  We  will  look  at 
available  evidence  in  four  areas:  (1)  unit 
nonresponse,  (2)  item  nonresponse,  (3)  data 
consistency,  and  (4)  the  recording  of  textual 
material . 

5 . 1  Unit  Nonresponse 

CATI  can  affect  survey  nonresponse  through 
the  interviewers'  and  respondents'  reactions  to 
this  new  medium  or  through  its  special  features, 
such  as  online  call  scheduling  and  case  assign¬ 
ment.  The  small  but  consistent  literature  on 
interviewer  and  respondent  reactions  suggests: 
(1)  that  interviewers  either  prefer  CATI  to 
paper- and- pencil  methods  or  are  about  evenly 
divided  in  their  preferences  between  these  two 
modes;  and  (2)  that  respondents  accept  CATI  as 
well  as  other  forms  of  telephone  Interviewing  or 
are  unaware  of  the  interviewing  mode  employed. 
(Coulter,  1985;  Groves  and  Mathiowetz,  1984; 
Morton  and  House,  1984;  Nicholls,  1978.)  While 
these  reactions  may  change  with  time  or  special 
circumstances,  both  interviewers  and  respondents 
appear  to  regard  CATI  as  an  acceptable  method  of 
telephone  data  collection. 

There  is  little  reason  to  anticipate, 
therefore,  that  interviewers'  or  respondents' 
reactions  will  affect  survey  response  rates. 
Coleman  (1985)  reports  identical  response  rates 
for  CATI  and  non-CATI  treatments  in  the 
Malignant  Melanoma  Survey.  Similarly,  Groves 
and  Mathiowetz  (1984)  found  nearly  identical 
reponse  rates  in  two  of  three  replicates  of  the 
RDD  Health  Survey  Test.  A  statistically  signif¬ 
icant  difference  occurred  only  in  the  first 
replicate,  during  a  period  when  the  CATI  system 
could  not  maintain  acceptable  response  times 
between  questions.  Under  these  circumstances, 
interviewer  or  respondent  reactions  to  CATI  may 
have  lowered  the  response  rate . 

Other  investigators  have  focused  specifically 
on  refusal  rates,  that  is  the  percent  of 
contacted  households  who  refuse  an  interview. 
Of  the  five  experimental  and  comparative  studies 
summarized  in  Table  4,  four  found  no  difference 
between  CATI  and  non-CATI  refusal  rates.  In  the 
Census  of  Agriculture,  where  telephoning  was 
used  for  follow-up  to  mail  nonresponse,  the  CATI 


Survey 


CATI  Non-CATI 


Colo-Rectal  Cancer  Survey 

(Harlow  et  al. ,  1985) . 

Nebraska  Hog  Survey 

(Coulter,  1985) . 

Cattle  Inventory  Survey 

(House,  1984) . 

Survey  of  Scientists  and 
Engineers  (Ferrari,  1984). 
Census  of  Agriculture 

(Ferrari,  1986) . 


21.7%  21.2% 


5 . 2%*  12.5% 


*Statistically  significant  difference  between 
CATI  and  non-CATI  at  the  .05  level. 

A  second  major  component  of  unit  nonresponse 
is  failure  to  reach  sampled  respondents  because 
their  telephone  numbers  cannot  be  found  or  their 
numbers  are  not  answered  when  called.  Table  5 
compares  CATI  and  non-CATI  contact  rates,  the 
percent  of  assigned  cases  whose  households  were 
reached,  whether  that  contact  resulted  in  an 
interview,  a  refusal,  other  noninterview  or  a 
determination  of  ineligibility.  In  telephone 
follow-up  to  mail  nonresponse  in  the  Survey  of 
Scientists  and  Engineers  and  in  the  Census  of 
Agriculture,  current  telephone  numbers  and 


CONTACT  RATES  BY  MODE  IN  THREE  SURVEYS 


Survey 


CATI  Non-CATI 


Survey  of  Scientists  and 
Engineers  (Ferrari,  1984). 

Census  of  Agriculture 

(Ferrari,  1986) . 


Cattle  Dual  Frame  Survey 
(House,  1984) . 


50.2%*  44.4% 


84.3%*  79.0% 


72%*  57% 


♦Statistically  significant  difference  between 
CATI  and  non-CATI  at  the  .05  level. 


addresses  frequently  were  not  available;  and  the 
interviewing  assignments  included  tracing  such 
respondents  through  directory  assistance  and 
other  sources.  The  CATI  system  included  these 
tracing  steps  in  its  call  scheduling  and  case 
assignment  procedures.  The  paper-and-pencil 
interviewers  were  given  guidelines  with  the  same 
procedures  but  independently  selected  cases  to 
call  within  batched  assignments.  In  both 
surveys,  the  CATI  telephone  follow-up  staff 
obtained  a  significantly  higher  contact  rate 
than  the  non-CATI  staff.  While  online  tracing 
and  call  scheduling  appears  to  have  produced 
this  result,  uncontrolled  factors  in  these 
comparative  studies  also  may  have  contributed. 

House  (1984)  reports  a  similar  result  for  a 


system  for  the  Survey  of  Scientists  and 
Engineers  was  this  type  of  application.  As 
shown  in  Table  6,  the  CATI  staff  obtained  sub¬ 
stantially  lower  rates  of  item  nonresponse  than 
the  comparison  paper-and-pencil  staff.  The 
greater  difficulty  of  omitting  applicable 
questions  in  CATI  apparently  contributed  to  this 
difference,  but  since  the  results  are  not  based 
on  a  fully  controlled  experimental  design  they 
must  remain  only  suggestive. 

TABLE  6 

MEAN  PERCENT  ITEM  NONRESPONSE  BY  TOPIC  AND  MODE 
IN  ONE  EXPERIMENTAL  AND  TWO  COMPARATIVE  STUDIES 


CATI  system  without  online  call  scheduling.  In 
this  survey,  however,  CATI  stations  were  in 
short  supply,  and  to  make  maximum  use  of  the 
available  equipment,  a  supervisor  stood  behind 
the  four  CATI  interviewers  and  chose  cases  for 
them  to  call.  The  non-CATI  interviewers  worked 
independently.  Online  call  scheduling  may  be 
viewed  as  automation  of  such  manual  supervisory 
support . 


5 . 2  Item  Nonresponse 


Item  missing  data  arise  both  from  interviewer 
failure  to  ask  questions  or  enter  responses  and 
from  respondent  failure  to  provide  substantive 
answers.  One  of  the  most  frequently  cited 
advantages  of  CATI  is  rigid  control  over  ques¬ 
tion  flow  and  recording  of  responses,  forcing 
the  interviewer  through  each  question  appro¬ 
priate  to  the  respondent  and  requiring  an  entry 
at  each  question  displayed.  In  principle,  this 
feature  can  eliminate  errors  from  interviewers 
inadvertently  or  intentionally  skipping  items. 
While  it  is  possible  to  prevent  interviewers 
from  entering  “don't  know"  responses  by  limiting 
acceptable  entries,  in  practice  interviewers 
generally  are  permitted  to  enter  "refused"  or 
"don't  know"  to  any  question,  just  as  in  paper- 
and-pencil  interviewing.  Forced  entry  at  each 
question  does  not  ensure  recording  of  a 
substantively  meaningful  value. 

Three  studies  comparing  item  nonresponse  by 
mode  are  summarized  in  Table  6.  Groves  and 
Mathiowetz  (1984)  found  the  same  levels  of  item 
nonresponse  for  CATI  and  paper  interviews  on  six 
demographic  and  income  items.  Fielder  (1985) 
reports  smaller  levels  of  item  nonresponse  for 
CATI,  although  none  of  the  differences  reach 
common  levels  of  statistical  significance.  In 
these  two  studies,  which  employed  the  telephone 
as  the  primary  data  collection  mode,  CATI  seems 
to  have  little  or  no  effect  on  item  nonresponse. 

The  consequences  of  CATI  for  item  nonre¬ 
sponse,  however,  may  depend  on  the  content  of 
the  items,  the  design  of  the  questionnaire  in 
each  mode,  and  the  training  and  supervision  of 
the  interviewers.  In  cases  where  the  number  of 
items  asked  is  partly  left  to  the  interviewers' 
judgment,  CATI  may  produce  large  reductions  in 
item  nonresponse.  This  can  occur  in  telephone 
follow-up  to  mail  nonresponse  where  the  inter¬ 
viewer  must  reconcile  the  conflicting  goals  of 
obtaining  as  much  information  as  possible  and 
not  antagonizing  possibly  reluctant  respondents. 
The  first  test  of  the  US.  Census  Bureau's  CATI 


Survey  and  Topic 

CATI 

Non-CATI 

Health  Survey  Test  (Groves 
and  Mathiowetz,  1984) 

Sex,  age,  education . 

0.7% 

0.9% 

Race,  income,  marital... 

14.6% 

15.0% 

Sample  size . 

SRC -UCLA  Earthquake  Survey 
(Fielder,  1985) 

(942) 

(1,137) 

15  Demographic  items.... 

0.3% 

0.5% 

11  Opinion  items . 

2.3% 

5.3% 

Sample  size . 

(536) 

(516) 

Survey  of  Scientists  and 
Engineers  (Ferrari,  1984) 

14  Items  asked  of  all 

respondents** . 

7.1%* 

24.6% 

12  Items  asked  of  most 

respondents** . 

7 . 5%* 

26.6% 

Sample  size . 

(3,056) 

(16,159) 

♦Statistically  significant  difference  between 
CATI  and  non-CATI  at  the  .05  level. 

♦♦Excludes  imputed  items  and  those  constructed 
during  post- interview  computer  edits. 


5. 3  Data  Consistency 


The  data  for  a  case  may  be  described  as 
"consistent"  if  the  entered  values  do  not 
contradict  one  another.  Consistency  may  be 
limited  to  the  responses  to  one  interview  or 
extended  across  successive  interviews  and  to 
prior  information  from  records  and  other 
sources.  Paper-and-pencil  methods  strive  for 
data  consistency  by  asking  the  interviewer  to 
probe  obviously  inconsistent  responses,  by 
supervisory  or  clerical  review  of  completed 
forms,  by  batch  computer  editing  after  the 
interview  is  keyed,  and  by  reinterviewing  cases 
which  fail  these  edits.  Data  consistency 
ensures  neither  validity  nor  reliability  but  is 
generally  regarded  as  a  useful  measure  of  data 
quality. 

Computer-assisted  telephone  interviewing  may 
contribute  to  data  consistency  in  two  primary 
ways.  The  first  is  through  automatic  branching 
between  items  to  ensure  that  all  applicable 
questions  are  asked,  or  at  least  displayed  for 
the  interviewer.  (Inapplicable  questions  are 
omitted  or  appropriately  marked  if  previously 
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asked  and  later  found  to  be  Inapplicable.)  The 
second  is  through  online  editing  in  which 
apparently  inconsistent  responses  require  addi¬ 
tional  actions  by  the  interviewer,  such  as: 
backing  to  correct  prior  entries;  or  making  an 
additional  entry  to  explain  the  reason  for  the 
inconsistency. 

Two  studies  with  controlled  experimental 
designs  have  provided  first  evidence  of  CATl's 
effects  on  data  consistency.  Groves  and 
Mathiowetz  (1984)  analyzed  a  sequence  of  28 
questions  with  complex  skip  patterns  and  found 
that  8.8  percent  of  the  entries  by  the 
paper-and-pencil  interviewers  had  consistency 
errors  compared  with  1.8  percent  of  the  entries 
by  CATI  interviewers .  Contributions  of  CATI  to 
data  consistency  through  online  editing  have 
been  reported  by  Tortora  (1985)  in  the  Cattle 
Inventory  Survey.  When  data  sets  from  CATI  and 
non-CATI  interviews  were  submitted  to  the  same 
batch  computer  edits,  the  CATI  data  were  found 
to  contain  75  percent  fewer  critical  errors  than 
the  paper-and-pencil  data.  Critical  errors  were 
defined  as  those  requiring  another  contact  with 
the  sample  case. 

Placing  a  survey  on  CATI  does  not  necessarily 
improve  data  consistency  across  all  items .  The 
effect  will  occur  only  where  automatic  branching 
and  online  editing  enhance  consistency. 
Improved  consistency  also  may  be  difficult  to 
detect  if  comparisons  are  made  only  after  the 
data  are  clerically  edited  and  imputations  made 
for  missing  data  and  out-of- range  entries. 
Ferrari  (1986)  reports  only  a  trivially  lower 
(although  statistically  significant)  overall 
edit  failure  rate  for  CATI  than  for  non-CATI 
telephone  follow-up  data  in  the  Census  of 
Agriculture  when  submitted  to  the  same  batch 
computer  edits  after  the  close  of  field  work. 
Moreover,  the  CATI  edit  failure  rate  was 
significantly  higher  for  some  key  items  not 
included  in  the  CATI  online  edits.  For  maximum 
gains  in  data  consistency,  online  editing  must 
be  extensively  employed  and  parallel  the  key 
requirements  of  the  batch  editing  programs . 

5.4  Recording  of  Textual  Material 

Computer-assisted  telephone  interviewing  is 
generally  viewed  as  most  effective  in  obtaining 
numeric  and  precoded  responses.  Concerns  are 
more  frequently  expressed  about  the  quality  of 
textual  materials  obtained,  such  as  entries  to 
open-ended  questions  and  interviewer  notes 
qualifying  or  explaining  respondents'  answers. 
Most  CATI  systems  permit  entry  of  extended 
answers  and  qualifying  notes  to  any  question, 
but  their  entry  may  be  more  awkward  in  CATI  than 
with  paper  methods.  CATI  interviewers  often  are 
required  to  have  minimal  typing  skills,  at  least 
20  words  per  minute ,  and  trained  to  inform 
respondents  that  they  are  using  a  keyboard  when 
unable  to  keep  up  with  the  respondent's  answers. 
Nevertheless,  the  slower  rate  of  textual  entry 
in  CATI  may  reduce  the  completeness  of  answers 
to  open-ended  questions  and  discourage  entry  of 
qualifying  notes.  Separate  actions  required  to 
access  the  notes  function  also  may  discourage 
its  use. 


Census  of 
same  batch 


Morton  and  House  (1983)  summarize  field  staff 
impressions  on  two  CATI  surveys  which  suggest 
that  the  recording  of  textual  material  is  not  a 
problem.  They  say:  "...  we  found  that  the  lack 
of  typing  speed  did  not  seem  to  be  an  irritant 
to  the  respondent,  and  the  speed  did  improve  as 
interviewers  felt  more  comfortable  with  the 
keyboard.”  No  quantitative  data  have  yet  been 
presented  to  compare  the  quality  of  CATI  and 
non-CATI  responses  to  open-ended  questions;  but 
Harlow,  et  al. ,  (1985)  found  that  CATI  inter¬ 
viewers  entered  25  percent  fewer  comments  than 
non-CATI  interviewers;  and  this  contributed  to  a 
lower  rate  of  unresolved  "don't  know"  responses 
after  comments  were  employed  in  clerical 
editing.  These  differences  were  not  statis¬ 
tically  significant  in  the  small  samples 
examined,  but  they  are  supported  by  similar 
observations  by  Tortora  (1985).  Further 
research  on  these  topics  is  clearly  needed. 

6 .  Summary 

The  remarkable  growth  of  CATI  in  commercial 
market  research,  in  university  research  centers, 
and  in  the  planning  of  government  agencies  has 
proceeded  largely  without  firm  research  results 
on  its  consequences  for  survey  costs  and  data 
quality.  Detailed  evidence  about  CATl's  data 
collection  characteristics  have  begun  to  appear 
only  in  recent  years . 

When  compared  with  paper-and-pencil  methods, 
CATI  typically  entails  higher  costs  in  comput¬ 
ing  hardware  and  software  and  perhaps  in  survey 
design.  Offsetting  savings  are  most  likely  to 
be  realized  in  interviewer  productivity  and 
post-interview  processing.  Due  to  the  typically 
longer  length  of  CATI  interviews,  interviewer 
productivity  may  not  be  increased  by  CATI 
systems  without  online  call  scheduling  and  case 
assignment.  But  with  these  capabilities,  major 
increases  in  interviewer  productivity  seem 
possible . 

The  effects  of  CATI  on  data  quality  generally 
appear  to  be  small  or  negligible  except  when 
specific  data  quality  enhancement  features  are 
employed.  CATI  typically  has  no  effect  on  the 
response  rates  and  refusal  rates  of  telephone 
interviews ,  but  may  increase  contact  rates  in 
some  applications  with  efficient  online  call 
scheduling  and  tracing  routines.  Similarly, 
CATI  typically  has  little  or  no  effect  on  item 
nonresponse  except  in  applications  where 
automatic  branching  encourages  interviewers  to 
ask  questions  they  might  otherwise  omit.  CATI 
does  increase  data  consistency  but  only  where 
its  automatic  branching  and  online  editing 
features  are  used.  At  the  same  time,  CATI  may 
result  in  less  complete  entries  to  open-ended 
questions  and  less  frequent  interviewer 
comments . 

CATI  remains  a  promising  technology  for 
survey  data  collection,  but  like  other  data 
collection  methods  can  be  expected  to  have  its 
own  strengths  and  weaknesses.  Further  research 
is  required  to  identify  these  more  fully,  both 
to  guide  appropriate  choices  of  method  for 
specific  surveys  and  to  stimulate  corrective 
measures  in  areas  where  weaknesses  are  found. 
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INFERENCE  FROM  COARSE  DATA  USING  MULTIPLE  IMPUTATION 


Daniel  F.  Heitjan,  U.C.L.A. 
Donald  B.  Rubin,  Harvard  University 


Inference  from  Coarse  Data  Using  Multiple 
Imputation 


Multiole  imputation  is  a  procedure  for  han¬ 
dling  inadequate  data  by  filling  in  several  plau¬ 
sible  values  for  each  inadequately  reported  value. 
The  basic  ideas  underlying  multiple  imputation 
are  reviewed  and  then  applied  to  a  data  set  with 
coarsely  reported  ages  of  children.  Sensitivity 
analyses  and  diagnostic  displays  are  included. 


1 .  Multiple  Imputation 


Pervasiveness  of  inadequate  data 

Essentially  all  data  collected  in  surveys  are 
inadequate  in  some  aspects.  Commonly  for  example, 
survey  data  suffer  from  nonresponse:  some  sam¬ 
pled  units  do  not  provide  answers  to  some  ques¬ 
tions.  Usually  such  nonresponse  leads  to  missing 
values,  but  a  less  extreme  form,  perhaps  called 
"partial  nonresponse"  leads  to  data  much  coarser 
than  desired.  For  instance,  in  many  surveys  in¬ 
come  questions  suffer  from  either  nonresponse  be¬ 
cause  some  individuals  refuse  to  divulge  their 
incomes,  or  partial  nonresponse  because  only 
coarse  information,  such  as  above  or  below  $20,000 
is  reported. 


Imputation 

Imputation  is  the  process  of  filling  in  each 
missing  value  with  a  specific  value,  such  as  the 
respondents'  mean  for  that  variable  or  a  value 
predicted  from  variables  that  are  observed  for 
that  unit.  For  example,  if  income  is  missing  for 
a  group  of  Individuals,  it  might  be  imputed  using 
predictions  based  on  a  regression  of  log  income 
on  fully  observed  background  characteristics  us¬ 
ing  those  individuals  who  reported  income.  With 
partial  nonresponse,  such  as  coarsely  reported 
income,  a  specific  value  for  income  consistent 
with  the  coarsely  reported  value  would  be  imputed. 


Advantages  and  disadvantages  of  Imputation 

The  practice  of  imputing  for  missing  values  is 
very  common  because  it  has  the  obvious  practical 
advantage  of  allowing  standard  complete-data  meth¬ 
ods  of  analysis  to  be  used.  This  advantage  is 
extremely  important  not  only  when  forming  infer¬ 
ences  but  also  when  conducting  diagnostic  analy¬ 
ses.  Imputation  also  has  an  advantage  in  many 
contexts  in  which  the  data  collector  (e.g.,  the 
Census  Bureau)  and  the  data  analyst  (e.g.,  a  uni¬ 
versity  social  scientist)  are  different  individ¬ 
uals,  because  the  data  collector  may  have  access 
to  more  and  better  information  about  nonrespond¬ 
ents  than  the  data  analyst.  For  example,  in  some 
cases,  information  protected  by  confidentiality 
constraints  such  as  zip  codes  of  dwelling  units, 
may  be  available  to  help  impute  missing  values 
such  as  annual  Incomes.  The  obvious  disadvantage 
of  single  imputation  is  that  imputing  a  single 
value  treats  that  value  as  known.  Consequently, 
without  special  adjustments,  single  imputation 
cannot  reflect  sampling  variability  about  the 
correct  value  to  impute  even  supposing  the  rea¬ 
sons  for  nonresponse  are  known,  nor  can  single 
Imputation  represent  the  additional  uncertainty 
that  arises  when  the  reasons  for  nonresponse  are 


not  known. 


Multiple  imputation 

Multiple  imputation,  in  contrast  to  single  im¬ 
putation,  replaces  each  missing  value  not  with  a 
single  value  but  with  a  vector  of  MS  2  imputed 
values.  The  M  values  are  ordered  in  the  sense 
that  M  completed-data  sets  are  created  from  the 
vectors  of  imputations:  replacing  each  missing 
value  by  the  first  component  in  its  vector  of 
imputations  creates  the  first  completed  data  set, 
and  so  on.  Standard  complete-data  methods  are 
used  to  analyze  each  data  set.  That  is,  standard 
complete-data  methods  of  inference  and  diagnosis 
are  used  on  each  completed  data  set.  When  the  M 
sets  of  imputations  are  repeated  random  draws  un¬ 
der  one  model  for  nonresponse,  with  each  set  cor¬ 
responding  to  an  independent  drawing  of  the  para¬ 
meters  and  missing  values  from  their  posterior 
predictive  distribution,  the  M  complete-data  in¬ 
ferences  can  be  combined  to  form  one  inference 
that  properly  reflects  uncertainty  due  to  nonre¬ 
sponse  under  that  model.  When  the  imputations 
are  drawn  from  two  or  more  models  for  nonre¬ 
sponse,  the  combined  inferences  under  the  models 
can  be  contrasted  across  models  to  display  sensi¬ 
tivity  of  inference  to  models  for  nonresponse,  a 
particularly  critical  activity  when  the  precise 
reasons  for  nonresponse  are  unknown.  Thus  multi¬ 
ple  imputation  has  the  advantages  of  single  impu¬ 
tation  but  rectifies  both  disadvantages.  The 
only  disadvantage  of  multiple  imputation  over 
single  imputation  is  that  it  takes  more  work  to 
create  the  imputations  and  analyze  the  results. 
The  extra  work  in  analyzing  the  data,  however,  is 
really  quite  modest  in  today's  computing  environ¬ 
ments  since  it  basically  involves  performing  the 
same  task  M  times  Instead  of  once. 

Multiple  imputation  was  first  proposed  in 
Rubin  (1978).  A  comprehensive  treatment  is  given 
in  Rubin  (1986a);  other  easily  accessible  refer¬ 
ences  include  Rubin  (1986b),  Herzog  and  Rubin 
(1983),  and  Rubin  and  Schenker  (1986). 


Forming  summary  inferences  from  a  multiply 
Imputed  data  set 

Forming  summary  inferences  from  a  multiply  im¬ 
puted  data  set  is  quite  direct.  First,  each  data 
set  completed  by  Imputation  is  analyzed  using  the 
same  complete-data  method  that  would  be  used  in 
the  absence  of  nonresponse.  Let  9^,  U^, 
l  =  1 ,  . . . ,  M  be  M  complete-data  estimates  and 


their  associated  variances  for  an  estimated  8, 
calculated  from  M  repeated  imputations  under  one 
model.  For  instance,  when  estimating  a  proporr 
tion  8  from  a  simple  random  sample  of  size  n,  0^ 
is  given  by  p^,  the  proportion  of  successes  cal¬ 
culated  using  £th  set  of  imputed  values  for  the 
missing  values,  and  U^  is  given  by  ( 1  -p^)/n, 
at  least  for  modestly  large  n  and  p^  not  too  near 
0  or  1 .  The  combined  estimate  is 


M 
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The  variability  associated  with  this  estimate  has 
two  components:  the  average  within-imputation 
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variance, 
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and  the  between-imputation  component, 
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(with  row  vector  8, 

The  total  variability  associated  with  (6^-8)  is 


T..  =  U„ 


M+  1 


'M’ 


where 

scalar 


M  +  1 


M 


is  an  adjustment  for  small  M.  With 


0  and  small  M,  the  reference  distribution 
for  interval  estimates  and  significance  tests  is 
a  t  distribution, 
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where  the  degrees  of  freedom, 

V  =  (M-i)[i+^  (5m/bm)]: 


is  based  on  a  Satterthwaite  approximation  (Rubin 
and  Schenker,  1986).  When  M  is  large,  the  infer¬ 
ence  for  9  is  based  on  the  normal  approximation 
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N(0,  1). 


For  9  with  r  components,  significance  levels 
for  null  values  of  0  can  be  obtained  from  M  re¬ 
peated  complete-data  estimates,  0£,  and  variance- 


covariance  matrices,  using  muldvartaCe  ana¬ 


logues  of  the  above  expressions.  Less  precise  p- 
values  can  be  obtained  directly  from  M  repeated 
complete-data  significance  levels.  Details  may 
be  found  in  Rubin  (1986a). 

Although  multiple  imputation  is  most  directly 
motivated  from  the  Bayesian  perspective,  the  re¬ 
sultant  inference  can  be  shown  to  possess  good 
sampling  properties.  For  example,  Rubin  and 
Schenker  (1986)  show  that  in  many  cases  interval 
estimates  created  using  only  two  imputations  pro¬ 
vide  randomization-based  coverages  close  to  their 
nominal  levels. 

Missing  information 

The  ratio  UM/BM  estimates  the  quantity 
(l-y)/y  where  y  is  the  fraction  of  information 
about  0  missing  due  to  nonresponse.  This  frac¬ 
tion  is  important  in  two  ways.  First,  it  affects 
the  adequacy  of  the  distributional  approximations 
proposed  above.  Second,  y  governs  the  efficiency 


of  9^  as  an  estimator  of  9;  specifically,  the  _ 


variance  of  0^  is  larger  than  the  variance  of 
by  the  factor 
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Beyond  one  summary  Inference  from  a  multiply- 
imputed  data  set 

Although  the  ability  to  form  one  summary  in¬ 
ference  when  the  multiple  imputations  are  re¬ 
peated  draws  from  the  posterior  predictive  dis¬ 
tributions  of  the  missing  values  is  important, 
equally  important  is  the  fact  that  the  creation 
of  complete  data  sets  allows  (1)  the  use  of  stan¬ 


dard  diagnostic  techniques  to  help  criticize  pos¬ 
ited  models  and  (ii)  the  assessment  of  sensitivity 
of  inference  to  various  models.  These  points  will 
be  illustrated  here  using  a  particular  survey  of 
nutritional  status  of  children  in  Tanzania  that 
suffers  from  coarse  reporting  of  ages.  Far  more 
comprehensive  presentations  of  the  data  and  these 
analyses  are  given  in  Heitjan  (1985)  and  Heitjan 
and  Rubin  (1986). 


The  Tanzania  Nutrition  Data 


The  data  base 


The  data  set  that  we  use  to  illustrate  multi¬ 
ple  imputation  for  coarse  data  consists  of  anthro¬ 
pometric  measurements  on  children  under  six  years 
of  age  from  eight  poor  rural  areas  in  Tanzania 
taken  by  nutrition  researchers  interested  in  esti¬ 
mating  the  extent  of  malnutrition  in  the  various 
regions  (Klmati,  1985).  Approximately  five  thou¬ 
sand  children  comprise  the  full  data  base;  we 
focus  on  the  270  children  from  the  Dodoma  region. 
In  addition  to  sex  of  child  and  age,  as  provided 
by  the  mother,  weight,  height,  mid-arm  circumfer¬ 
ence  and  head  circumference  were  recorded  by  the 
researchers. 


Objective  of  complete-data  analysis 

A  simple  way  to  measure  the  extent  of  malnu¬ 
trition  among  such  a  group  of  children  is  to  cal¬ 
culate  the  percentage  of  them  that  are  classified 
as  stunted  or  not  and  wasted  or  not.  Stunted  re¬ 
fers  to  being  short  for  age  and  wasted  refers  to 
being  light  for  height,  where  the  definitions  are 
established  for  boys  and  girls  from  data  collect¬ 
ed  on  groups  of  known  normal  healthy  children 
from  the  U.S.  (Hamill,  et  al. ,  1979).  Children 
who  are  classified  as  stunted  but  not  wasted  may 
simply  be  short,  and  children  who  are  classified 
as  wasted  but  not  stunted  may  simply  be  thin,  but 
simultaneous  stuntedness  and  wastedness  in  a  single 
child  are  regarded  as  clear  evidence  of  malnour- 
ishment.  Assuming  accurate  measurements  of  sex, 
age,  height  and  weight  in  the  270  children,  in¬ 
ference  for  the  extent  of  malnourishment  would  be 
based  on  p,  the  proportion  of  stunted  and  wasted 
children  among  the  270,  and  its  standard  error, 

SE  =  [p(l  -p)/n]*i  where  n  =  270.  The  data  as  re¬ 
ported  give  p  =  5.9%  and  SE  =  1.4%. 


Age-heaping 

The  problem  with  this  simple  answer  is  that, 
even  though  height  and  weight  are  accurate  meas¬ 
urements,  age  as  reported  by  mothers  is  quite 
coarse.  For  ages  over  a  year,  most  ages  in  months 
are  reported  as  divisible  by  6  —  this  phenomenon 
is  common  and  is  known  as  age  heaping  (e.g.,  see 
Ewbank,  1981).  The  problem  is  possibly  more  ser¬ 
ious  in  Tanzania  than  the  United  States  in  the 
sense  that  precise  date  of  birth  is  not  very  im¬ 
portant  in  Tanzania,  and  so  mothers  may  not  even 
know  their  children's  ages  to  the  nearest  month. 
Also,  some  evidence  suggests  that,  as  opposed  to 
the  situation  in  the  United  States  where  reported 
ages  are  typically  truncated,  reported  ages  may 
often  be  rounded  to  the  nearest  year  or  six 
months.  Figure  1  displays  the  reported  ages  for 
the  270  children. 
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Reported  Age  In  Months 


FIGURE  I:  Reported  Age  Hietogran  —  Dodoma 


Restrictions  on  true  age  Imposed  by  reported  age 
Ue  consider  two  versions  of  the  possible  inter¬ 
vals  for  true  age  given  reported  age,  where  for 
both  versions,  all  ages  reported  as  under  12 
months  except  six  month  reporters,  are  correct  to 
the  nearest  month.  With  "medium"  Intervals,  all 
ages  that  are  reported  to  be  a  full  year  (i.e., 
reported  age  in  months  =  0  mod  12)  are  considered 
to  arise  from  possible  true  ages  within  ±6  months 
of  the  reported  age,  and  all  ages  that  are  re¬ 
ported  to  be  a  mid-year  (i.e.,  age  in  months  «  6 
mod  12)  are  considered  to  arise  from  possible 
true  ages  within  ±3  months  of  the  reported  age. 
With  "wide"  intervals,  the  bounds  are  twice  as 
large,  that  is,  ±12  months  for  full-year  report¬ 
ers,  and  ±6  months  for  mid-year  reporters.  We 
label  the  width  factor  W  -  0  (medium)  and  1 
(wide) .  The  status  of  the  six  month  reporters  is 
determined  by  another  factor  R  -  0  (rounded)  and 
1  (exact) .  If  R  *  1 ,  a  reported  age  of  six 
months  is  treated  as  correct  to  the  nearest  month , 
whereas  if  R  ■  0,  a  reported  age  of  six  is  treated 
as  rounded,  within  ±3  months  if  W  =  0,  and  within 
±6  months  if  W  ■  1.  It  is  important  to  realize 
that  at  this  point,  absolutely  no  assumption  re¬ 
garding  the  distribution  of  age  within  these  in¬ 
tervals  is  being  made. 

Extreme  tables  imputation 

Of  course,  it  is  perfectly  possible  with  these 
definitions  of  possible  true  ages  that  the  re¬ 
ported  age  data  are  accurate  enough  for  purposes 
of  inference  about  the  proportion  of  malnourished 
children,  even  if  no  further  assumptions  are 
made.  Specifically,  since  height,  weight  and  sex 
are  reported  accurately,  the  wasted  classifica¬ 
tion  is  accurate  and  the  height  component  of  the 
stunted  classification  is  accurate.  Perhaps  no 
matter  what  the  childrens'  true  ages  within  these 
rounding  intervals,  inferences  for  the  proportion 
malnourished  will  be  stable. 

If  all  children  are  considered  as  young  as 
possible  given  their  reported  ages,  the  resulting 
proportion  of  the  270  that  will  be  classified  as 
malnourished,  popt,  will  be  as  small  as  possi¬ 
ble  —  the  subscript  opt  is  for  optimistic.  Sim¬ 
ilarly,  if  all  children  are  considered  as  old  as 


possible  given  their  reported  ages,  the  resulting 
proportion  of  the  270  that  will  be  classified  as 
malnourished,  ppes,  will  be  as  large  as  possi¬ 
ble  —  pes  is  for  pesimist ic.  In  other  words,  if 
we  impute  true  ages  that  are  as  young  as  possible, 
we  obtain  the  estimate  popt  with  associated  stan¬ 
dard  error  SE0pt  -  [popt(l  -  Popt>/n]  •  Table  1 
gives  the  results,  and  indicates  substantial  sen¬ 
sitivity  of  answers,  especially  considering  that 
each  1%  represents  many  hundreds  of  children. 


TABLE  1.  Extra ne  values  for  proportion  malnouriehed 


estimate 

(standard  error) 

Interval 

widths 

medium 

■aide 

pessimistic  ages 

6.7% 

8.2% 

(overstate  malnutrition) 

(1.5) 

(1.7) 

optimistic  ages 

3.3% 

1.6% 

(understate  malnutrition) 

u.i) 

(0.7) 

Plan  of  attack 

Since  the  extreme  tables  analysis  suggests 
that  the  coarseness  of  the  reported  ages  does 
have  an  important  practical  effect  on  inferences 
about  the  proportion  malnourished,  rr,  we  proceed 
to  perform  more  sophisticated  statistical  analy¬ 
ses  specifically  designed  to  take  the  coarseness 
into  account.  In  particular,  our  plan  is  to  use 
multiple  imputation  to  create  a  sequence  of  data 
sets  with  various  values  for  true  ages  from  which 
the  standard  complete-data  inference  for  ir  can  be 
calculated.  These  imputations  will  be  created 
using  a  variety  of  Bayesian  models  that  relate 
true  age  Y  to  reported  age  X  and  the  other  re¬ 
ported  characteristics  Z  ■  sex,  height,  weight, 
mid-arm  circumference  and  head  circumference. 

The  analyses  of  the  imputed  data  sets  within  each 
model  are  combined  to  form  a  valid  inference  un¬ 
der  that  model ,  and  then  these  Inferences  are 
contrasted  across  models  to  display  sensitivity 
of  inference  to  modelling  assumptions.  Further¬ 
more,  the  data  sets  completed  by  imputation  are 
used  to  help  diagnose  the  adequacy  of  the  under¬ 
lying  model. 


Models  for  True 
and  Diagnostics 


ee  and  Resulting  Inferences 


A  naive  and  obviously  Incorrect  model 

A  naive  first  pass  model  corresponds  to  as¬ 
suming  that  true  ages  Y  are  uniformly  distributed 
within  the  allowable  intervals  defined  by  the 
value  of  reported  age  X  and  the  levels  of  the 
factors  W  and  R.  Imputed  true  ages  are  created 
by  drawing  from  these  uniform  distributions. 

This  model  is  obviously  incorrect  given  the  ex¬ 
istence  of  background  variables  Z  because  it  im¬ 
plies,  for  example,  that  two  children  with  the 
same  sex  and  reported  age,  of  say  36  months,  have 
the  same  distribution  of  true  ages  despite  the 
fact  that  one  is  taller,  heavier,  and  has  larger 
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head  and  mid-arm  circumferences  than  the  other. 
Clearly,  the  bigger  child  is  probably  older  than 
the  smaller  child,  and  the  imputed  true  ages 
should  reflect  this  fact. 

Notation 

Let  g  be  the  generally  unobserved  indicator 
for  the  degree  of  rounding;  g  takes  on  three  val¬ 
ues  indicating:  reports  to  the  nearest  month  (0) , 
reports  to  the  nearest  mid-year  (1),  reports  to 
the  nearest  year  (2).  This  indicator  is  observed 
as  0  when  reported  age  is  neither  a  mid-year  or 
full-year  (X  *  0  mod  6),  but  is  either  0  or  1  when 
age  is  reported  as  a  mid-year  (X=6  mod  12)  since 
a  child  with  reported  mid-year  age  might  be  that 
age  to  the  nearest  month  and  would  have  reported 
to  the  nearest  month  no  matter  what  his  true  age. 
Similarly,  g  is  either  0,  1  or  2  when  age  is  re¬ 
ported  as  a  full  year  (X  -  0  mod  12).  The  statis¬ 
tical  problem  is  to  model  the  joint  conditional 
distribution  of  g  and  Y  given  Z  =  (sex,  weight, 
height,  mid-arm  circumference,  head  circumfer¬ 
ence).  Reported  age,  X,  is  a  fixed  function  of 
Y  and  g,  and  so  its  conditional  distribution  giv¬ 
en  (g, Y,  Z)  is  fully  specified  a  priori. 

The  regression  of  Y  on  Z 

The  joint  conditional  distribution  of  (g,  Y) 
given  Z  can  be  defined  by  first  specifying  the 
conditional  distribution  of  Y  given  Z  and  then 
the  conditional  distribution  of  g  given  (Y,Z). 

We  assume  a  standard  normal  linear  regression  for 
Ys  given  Z  where  the  exponent  or  scale  s  is 
either  0  or  Thus  the  factor  S  has  two  levels, 
0  *  raw  scale,  1  «  square  root  scale. 

The  final  specification  for  g 

One  class  of  models  that  we  fit  as  a  baseline 
assumes  g  is  fixed  and  known  from  the  value  of 
reported  age  X.  In  particular,  if  reported  age 
is  a  mid-year,  then  g  is  fixed  at  1,  and  if  re¬ 
ported  age  is  a  full  year,  then  g  is  fixed  at  2, 
otherwise,  g  is  0.  Thus,  all  children  with  mid¬ 
year  reported  ages  are  regarded  as  always  being 
mid-year  reporters,  and  all  children  with  full- 
year  reported  ages  are  regarded  as  always  being 
full-year  reporters.  This  model  is  intuitively 
not  very  satisfying  and  in  fact  does  not  fit  in 
well  with  our  stated  objective  to  specify  the 
conditional  distribution  of  g  given  (Y,Z),  but 
it  is  relatively  easy  to  fit.  The  second  class 
of  models  treats  g  as  a  random  variable  and  pos¬ 
its  a  proper  joint  conditional  distribution  of  g 
given  (Y,Z),  basically  of  a  probit  form  —  g  is 
created  by  trichotomizing  an  unobserved  normal. 
Heitjan  and  Rubin  (1986)  provide  details.  The 
factor  indicating  whether  the  intervals  are  fixed 
or  random  is  I  with  levels  0  for  "g  is  fixed  from 
X"  and  1  for  "g  is  a  random  variable." 

Creation  of  imputations 

Under  each  of  the  24  models  being  considered, 
posterior  distributions  were  estimated  for  the 
regression  parameters  under  a  noninformative 
prior  distribution.  An  EM  algorithm  (Dempster, 
Laird  and  Rubin,  1977)  was  used  in  conjunction 
with  Newton's  method  to  find  the  mode,  and  then 
the  posterior  distribution  was  approximated  as 
normal  with  mean  equal  to  the  mode  and  variance 
provided  by  the  second  derivative  at  the  model. 

A  regression  parameter  was  then  drawn  from  this 
posterior  distribution,  say  and  conditional 


on  $  «  values  of  true  ages  were  independently 
imputed  by  drawing  from  the  confined  distribution 
of  true  age  given  X  and  Z  for  each  of  the  270 
children.  This  process  was  effectively  repeated 
Independently  hundreds  of  times  for  each  of  the 
2^  models  to  create  hundreds  of  data  sets  with 
known  nutritional  status.  Each  data  set  was  sum¬ 
marized  by  the  standard  complete-data  statistics 
p  and  p(l  -p)/n  and  these  were  combined  within 
models  according  to  the  methods  of  Section  1  to 
create  24 

summary  inferences  about  it.  Results 
are  summarized  in  Table  2  which  gives  the  esti¬ 
mate  p,  its  standard  error  SE,  and  the  fraction 
of  information  missing  about  tt  due  to  coarse 
rather  than  precise  age  data. 


TABLE  2:  Sensitivity  of  Inference  for  it  acroee  16  models 
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Discussion  of  Table  2  -  sensitivity  analyses 

Table  2  also  gives  the  Yates'  21*  ANOVA  decom¬ 
position  into  effects  from  the  four  factors  in 
the  design  (Daniel,  1976).  With  respect  to  the 
point  estimate  p,  factors  I,  S  and  W  have  fairly 
large  main  effects,  but  none  is  bigger  than  one- 
third  of  the  associated  standard  errors,  which 
are  basically  unaffected  by  the  various  models. 
The  fractions  of  missing  information  are  quite 
variable:  all  main  effects  except  R  are  nonneg- 
ligible,  and  the  1W  interaction  is  also  present. 
It  is  rather  obvious  that  wider  intervals  should 
lead  to  larger  information  loss  but  it  is  inter¬ 
esting  to  see  that  the  variable  intervals  models 
also  lead  to  less  information  loss.  Actually, 
after  some  thought,  this  is  not  surprising  since 
for  each  child  the  possible  interval  under  the 
random  intervals  model  can  be  narrower  and  is 
never  wider  than  under  the  corresponding  fixed 
intervals  model.  For  all  models,  however,  the 
information  losses  are  such  that  far  fewer  than 
100  imputations  under  each  model  (say  ten)  would 
have  provided  essentially  the  same  inference  as 
an  infinite  number  of  imputations. 
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Diagnostic  checks 

One  of  the  benefits  of  multiple  imputation  is 
the  ability  to  draw  valid  inferences,  such  as  sum¬ 
marized  in  Table  2,  under  a  variety  of  models  us¬ 
ing  standard  complete-data  statistics.  Another 
advantage  accruing  from  the  creation  of  complete- 
data  sets  using  multiple  imputation  is  the  abil¬ 
ity  to  use  standard  complete-data  diagnostic  tech¬ 
niques.  For  instance,  residual  plots  using  im¬ 
puted  data  are  displayed  in  Figure  2  for  models 
{I  =  fixed,  S  -  raw,  W  =*  medium,  R  -  rounded}  and 
{I  =  fixed,  S  *  square  root,  W  «  wide,  R  *  round¬ 
ed}.  Such  displays  consistently  support  the  con¬ 
clusion  that  the  square  root  scale  is  superior  to 
the  raw  scale.  Also,  average  (across  imputations 
within  a  model)  histograms  of  five  imputed  true 
ages  were  produced  in  Figure  3  for  models 
{I  =  fixed,  S  =  raw,  W  *  medium,  R  -  rounded}  and 
{1  =  variable,  S  =  square  root,  W  =  medium, 

R**  rounded}.  Such  displays  consistently  support 
the  random  interval  -  medium  width  models  since 
they  did  not  have  the  objectionable  underheaping 
at  full-years  and  overheaping  at  mid-years  pre¬ 
sent  in  the  fixed  and  wide  interval  models. 


Model  *  {1*0,  $*1,W=1,R=I} 
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Summary 

In  summary,  the  procedure  of  producing  multi¬ 
ple  imputations  under  a  variety  of  models  gener¬ 
ated  the  following  conclusions. 

1.  The  square-root  scale  model  with  medium, 
variable  intervals  was  preferred  on  the 
basis  of  diagnostic  displays. 

2.  The  variable  intervals  models  led  to 
less  information  loss. 

3.  The  point  estimate  of  fraction  malnour¬ 
ished  was  relatively  insensitive  to  rea¬ 
sonable  model  specifications  when  con¬ 
sidered  as  a  fraction  of  its  standard 
error . 

For  further  research  purposes,  a  multiply  im¬ 
puted  data  set  is  available  with  five  repeated 
draws  of  true  ages  from  the  preferred  model 
(I  *  variable,  S  ■  square  root,  W  *  medium, 

R  «  rounded}.  From  Table  2,  the  information 
loss  for  n  with  this  model  is  approximately 
8.6%,  which  means  that  estimates  based  on  the 
five  imputed  values  have  1.7%  more  variance  than 
those  based  on  an  infinite  number  of  imputed 
values,  and  the  number  of  degrees  of  freedom  in 
reference  distributions  is  unaffected  by  the 
finiteness  of  the  number  of  imputations. 
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1 .  ABSTRACT 

We  have  been  evaluating  our  prototype 
data  analysis  management  system,  which 
was  designed  to  aid  the  analyst  in 
keeping  track  of  the  course  of  a  data 
analysis.  This  paper  describes  some  of 
our  experiences  using  the  prototype  and 
summarizes  our  evaluation.  Evaluated 
features  include  capabilities  to 
graphically  depict  the  course  of  the 
analysis,  the  ability  to  return  to 
previous  milestones  in  the  analysis,  the 
ability  to  use  segments  of  the  log  that 
describe  the  course  of  the  analysis,  and 
the  ability  to  associate  both  written  and 
spoken  documentation  with  milestones  of 
the  analysis. 

2.  INTRODUCTION 

The  Analysis  of  Large  Data  Sets  (ALDS) 
Project  of  the  Pacific  Northwest 
Laboratory,  operated  by  Battelle  Memorial 
Institute,  has  implemented  a  prototype 
data  analysis  management  system  named 
ADAM  [3-6].  ADAM  is  currently  running  on 
a  DEC  VAX  11/780  using  AT&T  Bell 
Laboratories'  S  statistical  analysis 
system  [1].  ADAM  is  implemented  as  a 
function  within  S,  whuh  means  that  the 
analyst  can  invoke  ADAM  from  S  at  any 
time  during  the  course  of  the  analysis. 

ADAM  was  designed  and  implemented  to 
evaluate  how  software  could  be  used  to 
help  the  analyst  track  the  course  of  the 
analysis.  A  team  of  statisticians  and 
computer  scientists  considered  the  data 
analysis  process  and  the  way  data 
analysts  interact  with  the  available 
tools.  While  the  quantity  and  quality  of 
tools  to  support  data  analysis  are 
improving,  there  has  been  very  little  to 
aid  the  analyst  in  keeping  track  of  what 
was  really  going  on  during  the  analysis. 
ADAM  was  designed  to  address  the 
shortcomings  of  existing  software  in  the 
area  of  data  analysis  management. 

3.  THE  ADAM  ENVIRONMENT 

ADAM  runs  in  a  relatively  unique 
environment  on  our  DEC  VAX  11/780.  The  s 
package  is  normally  run  using  the  UNIX 
operating  system,  but  since  we  had  such  a 
large  investment  in  software  developed 
under  the  native  VAX  operating  system, 
VMS,  we  did  not  convert  to  UNIX  but  are 
running  EUNICE,  a  UNIX  derivative  that 
allows  UNIX  software  to  run  on  a  VAX 
using  VMS. 

As  will  be  described  in  more  detail 
below,  ADAM  is  graphics-oriented.  We 
wanted  to  take  advantage  of  an  extensive 


graphics  library  developed  in-house. 
Rather  than  converting  the  library  to  run 
in  the  EUNICE  environment,  our  computer 
scientists  developed  a  strategy  that 
allows  an  S  process  running  under  EUNICE 
to  create  a  VMS  subprocess  that  handles 
the  graphics  portions  of  ADAM.  Borrowing 
a  phrase  from  science  fiction,  the 
computer  scientists  refer  to  this 
technique  as  going  through  a  "wormhole," 
a  wormhole  being  a  way  of  moving  between 
parallel  universes. 

A  standard  feature  of  S  is  the  diary 
function.  While  using  S,  the  analyst  may 
turn  on  the  diary  so  that  all  commands 
given  to  S  will  be  recorded  in  a  file. 
This  file  forms  a  temporal  record  of  the 
course  of  the  analysis.  We  have  modified 
the  S  diary  so  that  it  records  additional 
information  for  ADAM'S  use. 

Use  of  ADAM  is  optional.  When  the 
user  invokes  S,  start-up  procedures  will 
ask  whether  data  analysis  management  is 
to  be  used.  If  the  user  indicates  that 
it  will  be  used,  the  modified  diary  is 
opened  and  date  and  time  stamps  are 
inserted.  When  the  user  wants  to  access 
ADAM  to  perform  data  analysis  management 
functions,  ADAM  is  invoked  by  entering 

?adam 

4.  DEPICTING  THE  ANALYSIS 

When  we  examined  the  data  analysis 
process,  we  noted  that  analysts  could 
identify  significant  milestones  in  the 
analysis.  These  milestone  could  be 
points  at  which  some  significant 
discovery  was  made,  points  marking  the 
completion  of  a  phase  of  the  analysis,  a 
dead  end  from  which  no  more  analysis 
would  be  performed,  or  points  at  which 
there  were  several  alternative  paths  to 
be  investigated.  We  also  noted  that 
these  milestones  could  be  logically 
linked  together  to  form  a  tree.  Since 
analysts  often  wish  to  pursue  alternative 
paths  from  a  given  point  in  the  analysis, 
we  wanted  to  provide  a  facility  to  allow 
the  analyst  to  recreate  the  state  of  the 
analysis  at  that  point.  We  named  these 
points  "save-states".  ADAM  depicts  the 
course  of  the  analysis  as  a  tree  of  these 
save-states  (see  Figure  1).  The  save- 
states  are  depicted  as  labelled  boxes. 

The  lines  joining  the  boxes  represent  the 
analysis  steps  that  occurred  in  moving 
from  one  save-state  to  another. 

A  number  of  functions  can  be  performed 
on  save-states.  The  analyst  may  use  the 
CREATE  function  whenever  the  analyst 
decides  that  a  significant  point  has  been 
reached  in  the  analysis.  Once  an  analyst 
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has  created  a  save-state,  the  analysis 
can  continue  from  that  point,  or  the 
analyst  can  chose  to  RESTORE  a  previously 
created  save-state.  In  this  way,  an 
analyst  can  change  the  course  of  an 
analysis.  When  an  analyst  restores  a 
previous  save-state,  the  environment  will 


icon);  the  save-state's  data  sets;  and 
documentation  about  the  save-state. 

While  the  course  of  an  analysis  is 
primarily  a  tree,  the  process  is  not 
completely  tree-like.  At  any  point  in 
the  analysis,  it  may  be  useful  to  use  a 
data  set  that  was  not  derived  as  a  part 


be  the  same  as  when  the  save-state  was 
originally  created.  Only  the  data  sets 
that  were  active  when  that  save-state  was 
created  will  be  available  to  the  analyst 
when  the  save-state  is  restored. 

In  addition  to  depicting  the 
relationships  between  the  save-states, 
ADAM  can  also  display  more  detail  about  a 
selected  save-state  through  the  SCAN 
command  (see  Figure  2) .  The  analyst 
selects  the  save-state  to  be  scanned  and 
a  window  opens  with  an  overview  of  the 
save-state.  On  the  actual  ADAM  display, 
the  selected  save-state  is  a  different 
color  from  the  other  save-states.  In 
Figure  2,  the  selected  save-state's  box 
is  heavily  outlined.  The  analyst  can 
open  another  window  to  see  previously 
edited  comments  about  the  analysis  or  can 
listen  to  comments  recorded  on  a  cassette 
tape  deck  that  is  operated  under  computer 
control.  The  analyst  can  also  scan  the 
log  (called  a  diary  in  S)  to  see  the 
commands  that  led  to  the  creation  of  the 
save-state  being  scanned. 

Certain  information  about  a  restored 
save-state  can  be  modified  through  the 
MODIFY  command.  This  information 
includes  the  save-state's  name;  its 
author;  icons  that  indicate  the  existence 
of  plots  (the  eye  icon)  ,  written  comments 
(the  keyboard  icon) ,  and  spoken  comments 
(the  ear  icon) ;  an  icon  to  indicate  that 
special  insight  was  gained  in  this 
portion  of  the  analysis  (the  lightbulb 


of  the  process  that  created  the  most 
recent  save-state.  To  address  this 
issue,  we  allow  save-states  to  inherit 
data  from  other  save-state.  To  reduce 
the  clutter  on  the  screen,  these  "data 
from"  paths  are  not  normally  displayed, 
but  the  analyst  can  choose  the  SHOW 
NETWORK  option  in  order  to  see  what  data 
has  been  associated  with  a  save-state 
from  non-ancestor  save-states.  The  ERASE 
NETWORK  is  used  to  remove  the  "data  from" 
arrows  when  the  analyst  is  finished 
viewing  them. 

The  menu  window  of  Figure  1  shows  four 
other  menu  options.  The  MOVE  WINDOW 
option  allows  the  user  to  reposition  the 
various  windows  that  ADAM  uses.  Every 
menu  has  a  HELP  option  that  explains  the 
various  menu  options.  Many  menus  have  an 
S-MODE  option  that  allows  the  user  to 
leave  ADAM  and  return  to  the  S 
statistical  package  for  further  analysis. 
When  the  S-MODE  option  is  used,  the 
display  of  save-states  is  erased.  The 
RETURN  option  is  used  to  erase  an 
existing  menu  and  return  to  a  higher- 
level  menu.  For  example,  the  RETURN 
option  on  the  SCAN  menu  of  Figure  2 
erases  the  large  SCAN  window  and  paints 
the  state  menu  of  Figure  1  on  the  screen. 


USING  THE  LOG 


We  have  modified  the  standard  S  diary 
function  and  refer  to  it  as  the  ADAM  log. 
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In  addition  to  recording  the  analysis 
steps  taken  in  S,  ADAM  also  records  data 
analysis  management  steps.  Time-stamped 
information  is  recorded  in  the  log  every 
time  a  new  S  session  is  started  and  when 
a  save-state  is  created  or  restored.  S 
treats  the  ADAM  entries  as  comments  so 
the  analyst  can  use  segments  of  the  log 
containing  ADAM  entries  with  no  impact  on 

5. 

The  ADAM  log  depicts  the  temporal 
sequence  of  the  analysis  while  the  ADAM 
display  depicts  its  logical  sequence. 

ADAM  entries  divide  the  log  into  segments 
that  describe  the  development  of  a  save- 
state  from  its  parent  save-state. 

There  are  a  number  of  functions  the 
analyst  can  perform  on  the  log  segments. 
The  contents  of  the  log  segments  can  be 
displayed.  In  addition,  because  of  the 
importance  of  graphics  in  analysis,  the 
analyst  can  choose  to  display  only  the 
plot  commands.  The  analyst  may  edit  log 
segments  to  remove  superfluous  entries  or 
errors.  The  analyst  may  also  edit  the 
log  to  create  procedures  (called  "macros" 
in  S)  that  can  be  invoked  at  a  later 
time. 

6 .  DOCUMENTATION 

One  of  the  strongest  points  of  ADAM  is 
its  documentation  capabilities.  There 
are  three  ways  that  an  analyst  using  ADAM 
can  document  the  course  of  the  analysis. 
(1)  The  analyst  can  insert  comments  into 
the  log  while  S  is  being  used.  This  is  a 
normal  S  function  and  can  be  used  to 
provide  a  running  commentary  on  the 
course  of  the  analysis.  (2)  The  analyst 
can  edit  an  optional  comment  file 
associated  with  a  save-state.  (3)  The 
analyst  can  record  spoken  comments 
associated  with  save-states.  The  tape 
deck  on  which  the  comments  are  recorded 
is  under  ADAM's  control  so  that  ADAM  can 
track  which  selections  on  the  tape  are 
associated  with  which  save-state.  This 
option  is  useful  for  analysts  who  feel 
comfortable  dictating. 

Analysts  can  use  all  three 
documentation  modes  as  desired. 

Inserting  comments  into  the  log  while 
using  S  is  valuable  for  recording  the 
sequence  of  events  during  the  analysis. 
The  comments  associated  with  the  save- 
state  are  intended  to  capture  information 
surrounding  the  creation  of  the  save- 
state  as  well  as  discoveries  and  insights 
leading  to  its  creation.  The  spoken 
comments  can  be  used  in  the  same  way  and 
could,  in  addition,  be  used  to  record 
discussions  between  analysts. 

The  documentation  is  useful  for  a 
number  of  purposes.  It  can  provide  a 
historical  perspective  on  the  course  of 
an  analysis.  This  is  of  particular 
significance  if  an  analysis  is  examined 
later  when  it  becomes  difficult  to 
remember  exactly  what  was  done.  The 
documentation  can  be  used  as  a  basis  for 


reconstructing  the  analysis.  If  an 
analysis  has  to  be  performed  again  or  the 
same  analysis  is  performed  on  a  different 
data  set,  having  the  commands  readily 
available  can  save  much  time.  The 
documentation  is  useful  for  quality 
assurance  purposes.  For  example,  dead 
ends  are  useful  in  demonstrating  that 
alternatives  were  examined.  The 
documentation  records  insights,  purposes, 
and  relevance  of  the  various  activities 
in  the  analysis.  In  addition,  the 
documentation  can  be  used  as  sample 
analyses  for  training  others. 

7.  ADVANTAGES  AND  DISADVANTAGES 

Our  evaluation  of  ADAM  has  pointed  up 
both  advantages  and  disadvantages  to  the 
current  implementation.  Many  of  the 
disadvantages  of  ADAM  can  be  rectified 
through  the  use  of  tools  that  were  not 
available  to  us  when  the  design  of  ADAM 
began.  ADAM  makes  extensive  use  of 
windowing  but  its  windowing  is  too  slow. 
Many  workstations  have  built-in  windowing 
software  that  would  not  only  provide 
greater  speed  but  would  also  provide  a 
built-in  mechanism  so  that  experienced 
users  could  skip  some  levels  of  menus. 

It  would  also  be  advantageous  to  have 
multiple  windows  active  at  the  same  time. 
With  the  current  implementation,  the 
analyst  is  either  in  the  ADAM  mode  doing 
data  analysis  management  or  is  in  S  doing 
analysis.  It  is  time-consuming  to  move 
from  one  mode  to  the  other.  In  a 
workstation  environment,  a  data  analysis 
management  system  could  run  in  one  window 
while  a  data  analysis  system  is  run  in 
another . 

ADAM  does  not  actively  participate  in 
the  data  analysis.  Although  the  log  is 
recording  during  the  analysis,  there  is 
no  intervention  of  ADAM  in  the  process. 

As  a  result,  the  analyst  can  do  things 
that  will  cause  data  analysis  management 
to  fail.  For  example,  the  ability  to 
restore  a  save-state  is  based  on  the 
assumption  that  the  data  sets  still 
exists.  In  this  prototype,  the  analyst 
can  delete  a  data  set  that  ADAM  needs  for 
restoring  a  particular  state.  If  ADAM 
actively  participated  in  the  analysis, 
ADAM  could  inform  the  analyst  of  the 
impact  that  deleting  a  particular  data 
set  would  have.  If  the  analyst  wanted  to 
delete  the  data  set  anyway,  the  affected 
save-state  could  be  marked  as  non- 
restorable. 

Another  way  in  which  a  data  analysis 
management  system  should  actively 
participate  in  the  analysis  concerns  the 
cleaning  up  of  the  log.  In  ADAM,  the 
analyst  can  use  the  standard  text  editor 
to  make  changes  to  the  log.  It  is 
possible  that  an  analyst  might  remove 
necessary  information  from  the  log  in 
addition  to  removing  superfluous  entries. 
Rather  than  having  the  analyst  clean  up 
the  log,  the  data  analysis  management 


8.  CONCLUSIONS 


system  could  do  the  cleanup  in 
consultation  with  the  analyst.  This 
would  prevent  the  analyst  from 
inadvertently  removing  necessary 
information.  Work  on  the  auditing  of 
data  analyses  [2]  demonstrates  how  logs 
can  be  processed  to  determine  the 
evolution  of  data  sets  and  serves  as  one 
approach  to  cleaning  up  data  analysis 
logs. 

ADAM  allows  analysts  to  create  save- 
states  as  desired  but  provides  no 
mechanism  for  imposing  a  superstructure 
on  the  save-states.  The  analyst  may  wish 
to  group  a  set  of  save-states  into  a 
higher-level  structure  and  label  it.  We 
would  like  to  provide  a  capability  so 
that  this  grouping  would  normally  be  seen 
on  the  tree.  The  analyst  could  choose  to 
zoom  in  on  that  structure  and  see  the 
save-states  that  make  it  up.  Such  a 
facility  is  available  in  DINDE  [7],  a 
prototype  statistical  system  running  on  a 
Xerox  1108  personal  workstation.  DINDE 
graphically  depicts  the  analysis  but  has 
no  concept  of  save-state. 

Our  current  implementation  works  only 
in  conjunction  with  S.  Often  data 
preparation  is  done  outside  S.  It  would 
be  useful  to  have  ADAM  track  activities 
done  outside  of  S  as  well  as  activities 
done  in  other  data  analysis  packages. 
However,  we  used  S  for  the  prototype 
because  S  was  designed  to  be  extendable. 
We  had  access  to  the  S  source  code.  S 
has  its  own  interface  language  that 
allows  users  to  add  functions.  It  would 
have  been  very  difficult  to  implement 
ADAM  using  a  statistical  package  other 
than  S. 

There  are  many  advantages  to  the 
current  implementation  of  ADAM.  Strong 
points  of  ADAM  include  the  ability  to 
depict  the  logical  course  of  the  analysis 
in  addition  to  the  temporal  course  of  the 
analysis  represented  by  the  log  and  the 
ability  to  restore  previously-defined 
save-states . 

The  current  implementation  is  non- 
obstrusive.  It  does  not  interfere  with 
the  course  of  the  analysis,  but  can  be 
easily  invoked  as  needed.  It  makes  no 
attempt  to  guide,  assist,  or  consult  with 
the  analyst. 

ADAM  provides  a  familiar,  comfortable 
environment  in  which  to  work.  Because  it 
is  menu-driven,  there  are  no  extra 
commands  to  remember.  It  uses  the 
standard  text  editor.  The  current 
implementation  runs  on  the  Tektronix  4100 
series  of  terminals  so  it  can  easily  be 
run  from  the  analyst's  office. 

Convenience  of  access  is  important  to  our 
analysts.  Not  as  much  use  as  anticipated 
has  been  made  of  the  tape  deck  because  it 
is  not  portable  and  is  currently  in  an 
inconvenient  location.  Although  the 
figures  in  this  paper  do  not  include 
color,  ADAM  does  make  use  of  color.  On¬ 
line  help  is  available  for  every  menu. 


Much  has  been  learned  in  designing  and 
implementing  the  prototype  data  analysis 
management  system.  It  has  validated  many 
of  the  concepts  we  saw  as  being  basic  to 
understanding  the  data  analysis  process. 
We  believe  that  the  concepts  of 
graphically  depicting  the  course  of  the 
analysis  through  save-states  and  the 
ability  to  restore  a  save-state  for 
further  analysis  are  powerful  and  useful. 
Further  work  will  continue  to  incorporate 
these  concepts.  Our  next  data  analysis 
management  system  will  be  based  on  a 
workstation.  This  will  be  done  to  take 
advantage  of  the  software  available  on 
the  workstation  and  to  take  advantage  of 
the  speed  at  which  graphics  can  be 
generated  on  the  workstation.  However, 
we  will  continue  to  use  the  VAX  as  the 
machine  on  which  the  analysis  is 
performed  since  statistical  analysis  and 
graphics  tools  already  exist  there. 
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ABSTRACT 

Interactive  statistical  computer  programs  represent 
one  class  of  tools  which  have  made  it  easier  for  statis¬ 
ticians  to  carry  out  the  computations  associated  with 
data  analysis.  We  discuss  additional  tools,  both  soft¬ 
ware  and  hardware,  which  can  be  combined  with  in¬ 
teractive  statistical  packages  to  make  it  easier  for  the 
statistician  to  .  nplement  a  personal  strategy  for  analyz¬ 
ing  data.  An  integrated  collection  of  tools  for  data  anal¬ 
ysis  is  ca’K.d  a  computing  environment.  We  describe  the 
DAMSL  computing  environment  which  is  built  around 
off-the-shelf  hardware  and  software  costing  less  than 
$4,000.  This  environment  is  oesigned  to  alleviate  many 
of  the  managerial  burdens  which  arise  in  analyzing  data. 

o.  Introduction 

Adventures  in  Tomorrow  land.  The  fascination  and 
the  value  of  the  Interface  meetings  for  statisticians  is 
that  we  discuss  new  methods  of  data  analysis,  new  com¬ 
puting  methods,  new  hardware.  We  learn  about  new 
programs  that  people  are  working  on  that  employ  new 
metaphors  for  data  analysis  that  arise  from,  or  make 
good  use  of,  new  computing  technologies.  We  explore 
new  ideas  for  using  computer  hardware  and  for  build¬ 
ing  computer  software  that  may  in  the  future  radically 
transform  the  practice  of  data  analysis.  It  is  the  glimpse 
of  the  future  that  these  meetings  afford  that  makes  them 
so  attractive  for  many  of  us.  Let  me  note  some  examples 
from  this  year’s  Interface. 

John  Tukey  (1986)  writes  of  every  statistician  having 
a  background  program  running  on  his  or  her  personal 
workstation  which  will  study  and  diagnose  interesting 
aspects  of  a  data  set.  This  program  will  do  its  work 
during  the  statistician’s  off  hours:  during  lunch — at 
night,  during  faculty  meetings — in  short,  whenever  the 
statistician  is  not  using  it  for  other  useful  work.  Paul 
Tukey  (1986)  reports  on  his  work  developing  “cognos¬ 
tics, ”  algorithms  and  heuristics  for  an  otherwise  unas¬ 
sisted  computer  program  to  use  to  select  interesting 
views  of  a  multivariate  data  set  worth  further  scrutiny 
by  the  program’s  owner  (perhaps  when  he  returns  from 
breakfast).  Richard  Becker  and  John  Chambers  (1986) 
discuss  the  notion  of  “meta  data  analysis,”  in  which  the 
steps  taken  by  a  statistician  during  the  course  of  ana¬ 
lyzing  a  data  set  themselves  become  the  raw  data  for  a 
higher-level  analysis,  and  tools  for  collecting  such  data 
using  the  notion  of  an  audit.  Wayne  Oldford  and  Steve 
Peters  (1986)  outline  their  approach  to  building  sta¬ 
tistically  sophisticated  software — programs  which  know 
something  about  the  process  of  data  analysis.  Paula 
Cowley  and  her  colleagues  (1986)  present  their  experi¬ 
ence  developing  a  system  for  managing  and  organizing 
the  data-analysis  process. 

What  these  approaches  have  in  common  is  that  they 
are  all  exciting,  they  all  have  potential  to  change  the 


way  we  think  about  and  carry  out  data  analysis,  and 
they  all  exist  only  in  prototype  systems  not  generally 
available  for  public  consumption.  What  they  also  have 
in  common  is  that  at  least  part  of  each  can,  and  should, 
be  put  into  use  today  by  practicing  statisticians.  What 
can  be  done  (and  how)  is  the  subject  of  the  remainder 
of  this  paper. 

Three  concerns. 

[1]  Statisticians  have  acquired  familiarity  with  certain 
software  tools,  such  as  SAS,  S,  Glim,  Minitab,  and  many 
others.  Most  data  analysts  know  one  or  two  such  pack¬ 
ages  intimately  and  likely  have  at  least  passing  acquain¬ 
tance  with  others.  This  constitutes  a  major  investment 
of  time  and  energy  spent.  What  can  we  do  that  builcs 
on  this  investment  rather  than  making  it  obsolete?  As 
Crecine  (1986)  points  out,  users  of  computer  systems 
must  be  able  to  learn  a  transferable  technology  which 
will  not  become  obsolete  or  unavailable  tomorrow. 

[2]  Statisticians  have  adopted  strategies  for  doing  data 
analysis  using  computer  packages.  What  tools  can  be 
provided  which  make  it  easier  or  more  natural  to  imple¬ 
ment  those  strategies,  and  which  also  make  it  possible 
to  think  about  and  to  reflect  upon  the  strategies  them¬ 
selves? 

[3]  What  can  mere  mortals  (defined  to  be  those  with 
limited  pocketbooks,  limited  time  to  learn  new  systems, 
and  limited  access  to  prototype  systems)  do  today  to 
make  data  analysis  more  productive,  to  help  manage 
the  data-analysis  process,  and  to  start  thinking  about 
personal  statistical  strategies? 

The  DAMSL  system  described  below  addresses  these 
concerns  by  integrating  nonstatistical  tools  based  on  off- 
the-shelf  technology  costing  less  than  $3,000  at  today’s 
prices.  The  system  is  designed  to  be  a  computing  en¬ 
vironment  that  is  equally  useful  to  someone  using  Glim 
as  to  someone  using  Minitab,  as  the  tools  it  incorpo¬ 
rates  exist  on  top  of  existing  software  rather  than  built 
into  a  particular  statistical  program.  As  a  consequence, 
what  the  statistician  knows  about  a  particular  statist:- 
cal  package  does  not  become  obsolete — it  becomes  more 
useful!  Moreover,  the  tools  we  introduce  are  gener¬ 
ally  transferable,  so  that  exactly  the  same  tools  can  be 
used  if  one  switches  from  Minitab  to  do  a  rough  plot  to 
Glim  to  do  a  logistic  regression  in  the  middle  of  a  data- 
analysis  session.  What  we  do  presuppose  is  that  every 
data  analyst  has,  and  will  continue  to  have,  access  to 
standard  interactive  statistical  programs  such  as  those 
we  have  already  mentioned. 

1.  STATISTICAL  STRATEGY 

When  an  experienced  data  analyst  sits  down  at  the 
keyboard  to  examine  a  data  set,  he  or  she  employs  gen¬ 
eral  strategies  for  learning  what  the  data  have  to  say. 
These  strategies  include  heuristics  both  for  combining 
specific  techniques  and  methods  of  data  analysis  an  1 
for  using  the  chosen  computing  system.  The  strategies 


may  be  conscious  or  not.  Loosely  speaking,  they  con¬ 
stitute  the  data  analyst’s  “style.”  Statistical  strategy 
has  not  been  systematically  studied  until  very  recently, 
yet  is  is  extremely  important,  for  several  reasons.  First, 
in  teaching  statistical  methods,  what  we  seek  to  im¬ 
part  is  really  a  collection  of  fruitful  approaches  rather 
than  a  catalog  of  formulas.  Second,  if  we  can  bring  the 
strategies  that  we  (individually)  adopt  to  the  conscious 
level,  it  becomes  possible  to  examine  them,  to  identify 
successful  ones,  and  to  refine  them.  Doing  so  makes  it 
possible  to  recognize  more  readily  situations  in  which 
particular  approaches  may  not  be  fruitful.  If  statisti¬ 
cal  strategies  can  be  verbalized,  they  can  then  be  dis¬ 
cussed,  taught,  and  debated.  More  generally,  improved 
understanding  of  statistical  strategy  leads  to  better  and 
more  productive  data  analyses.  Third,  if  there  is  any 
hope  to  construct  expert  systems  that  can  assist  in  data 
analysis,  it  is  essential  to  come  to  a  more  complete  un¬ 
derstanding  of  the  nature  of  the  data  analysis  process, 
including  statistical  strategy.  Many  of  the  chapters  in 
Gale  (1986)  are  devoted  to  the  problem  of  collecting  and 
representing  knowledge  about  the  strategies  adopted  by 
data  analysts. 

Reflection  about  “what  we  did”  in  a  data  analysis  is 
difficult  to  carry  out,  since  after  the  fact,  many  spur- 
of-the  moment  decisions  will  have  been  forgotten,  ideas 
lost,  and  inter-relationships  among  pieces  of  the  analysis 
obscured.  Reasons  for  following  a  given  line  of  attack — 
particularly  unfruitful  ones — may  have  evaporated:  “it 
seemed  like  a  good  idea  at  the  time.”  Specially  designed 
computing  environments  (Thisted,  1986)  can  help  by 
looking  over  one’s  shoulder  during  the  analysis,  and  by 
making  it  possible  to  record  information  about  both 
the  intentions  which  lead  to  each  step  being  taken  and 
the  deductions  resulting  from  that  step  (Thisted,  1985). 
These  ideas  lie  behind  the  notions  of  auditing  a  data 
analysis  a  la  Becker  and  Chambers  (1986).  The  QPE 
system,  in  which  data  auditing  is  being  developed,  is  a 
prototype  system.  Some  of  the  features  of  data  auditing 
can  be  immediately  realized  using  DAMSL. 

2.  Data  analysis  Management 

Except  for  the  most  trivial  data  analyses,  the  anal¬ 
ysis  process  is  a  long  and  involved  one,  requiring  the 
statistician  over  several  days  to  keep  track  of  a  plethora 
of  modified  and  transformed  variables,  subsets  of  the 
data,  results  of  intermediate  analyses,  loose  ends  to  fol¬ 
low  up  on  later,  tables  and  graphs,  side  computations 
that  must  be  done  outside  the  main  statistical  package, 
and  the  like.  This  has  become  particularly  difficult  to  do 
using  the  standard  interactive  computing  systems  with 
which  most  statisticians  are  now  familiar,  for  example, 
using  Minitab  on  a  timesharing  computer  via  a  24-line 
video-display  terminal.  Much  of  the  statistician’s  men¬ 
tal  energy  and  organizational  skills  are  diverted  from 
the  data  per  se  and  redirected  to  these  essential,  yet 
peripheral,  matters. 

Although  courses  in  statistical  methodology  rarely 
address  the  issue,  the  practice  of  data  analysis  involves 
projects  of  moderate  to  high  complexity,  the  manage¬ 
ment  of  which  is  nontrivial.  Researchers  at  Batellc 


Northwest  Laboratories  have  made  considerable  progress 
in  developing  computer  tools  whose  function  is  to  as¬ 
sist  in  the  management  aspects  of  data  analysis  (Carr, 
Cowley,  and  Whiting,  1984;  Cowley  and  Whiting,  1985; 
Cowley,  Carr,  and  Nicholson,  1986).  The  ADAM  sys¬ 
tem  which  they  have  developed  is  integrated  with  the 
S  statistical  language.  Although  ADAM  only  exists  in 
prototype,  some  of  the  features  which  make  ADAM  at¬ 
tractive  enhance  the  productivity  of  data  analysts  and 
are  realized  in  DAMSL. 

3.  COMPUTING  TOOLS  TO  FACILITATE 
Data  analysis  Management 
and  Strategy 

We  now  turn  to  some  ideas  for  creating  a  computing 
environment  which  realizes  some  of  the  more  important 
features  of  prototype  systems  such  as  QPE  and  ADAM, 
and  which  can  do  so  at  low  cost.  The  system  we  de¬ 
scribe  can  be  obtained  immediately,  for  less  than  $3,000. 
(As  an  existence  proof,  we  name  names  and  list  prices.) 
The  emphasis  will  be  on  providing  tools  for  data  anal¬ 
ysis  management  and  for  analysis  and  implementation 
of  statistical  strategy. 

Computer  tools  for  assisting  data  analysis  can  be  di¬ 
vided  into  three  groups.  Intelligent  software  incorpo¬ 
rates  knowledge  about  the  process  of  data  analysis. 
DINDE,  described  in  Oldford  and  Peters  (1986)  is  an 
example  of  a  program  which  knows  a  moderate  amount 
about  such  things  as  collinearity  and  means  for  diagnos¬ 
ing  its  effects  in  different  situations.  Smart  programs 
incorporate  structural  knowledge  about  particular  sta¬ 
tistical  software,  but  know  nothing  directly  about  how 
the  software  could  be  used  to  perform  a  data  analysis. 
Examples  in  this  category  include  QPE  and  ADAM, 
each  of  which  knows  something  of  the  structure  of  S, 
but  nothing  of  the  structure  of  data  analysis.  The  last 
class  consists  of  the  dumb  software,  which  knows  noth¬ 
ing  of  data  analysis  or  of  statistical  programs.  DAMSL 
is  dumb  software.  But  it  is  available. 

The  advantages  of  the  dumb  approach  are  many.  It 
is  cheap.  The  product  is  portable.  It  is  hardware- 
independent,  in  the  sense  that  it  does  not  depend  upon 
the  particular  type  of  machine  on  which  you  do  your 
interactive  computing;  it  is  as  much  at  home  on  a  UNIX 
machine  as  on  a  DECSystem-20.  In  addition,  it  is 
software-independent,  in  the  sense  that  it  can  be  used 
with  S,  Minitab,  Glim,  SAS,  SCSS,  or  any  other  statis¬ 
tical  software,  in  a  fashion  that  is  insensitive  to  changes 
that  may  be  introduced  into  any  of  these  programs. 

The  idea  is  to  take  a  set  of  ordinary  programs  and  ma¬ 
chines  which  can  be  adapted  to  perform  tasks  involved 
in  developing  statistical  strategy  or  in  managing  data 
analysis.  This  hardware  and  software  is  then  integrated 
into  a  coherent  computing  environment  within  which 
ordinary  data  analysis  can  also  be  conducted.  The  com¬ 
ponents  of  DAMSL  are  listed  below.  We  shall  discuss 
the  capabilities  of  the  hardware  components  in  section 
4,  the  capabilities  of  the  software  components  in  sec¬ 
tion  5,  and  the  integration  of  these  components  with 
each  other  and  with  statistical  software  in  section  6. 


The  DAMSL  Hardware  and  Software 
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4.  The  Set  of  Hardware  tools 


FIGURE  1 


The  hardware  needs  of  a  statistical  computing  envi¬ 
ronment  could  be  met  by  any  of  a  number  of  personal 
computers.  The  features  which  are  important  (which 
the  Macintosh,  for  instance,  includes)  are:  high  resolu¬ 
tion  display  capable  of  both  text  and  graphics  process¬ 
ing,  adequate  memory  for  multiple  applications  to  run 
simultaneously,  printer  interface,  and  a  telecommunica¬ 
tions  (modem)  interface.  These  are  included  in  the  price 
of  the  Macintosh;  systems  based  on  other  personal  com¬ 
puters  (such  as  the  IBM  PC)  would  generally  have  to 
take  account  of  the  additional  expense  these  items  en¬ 
tail.  The  Macintosh  has  the  advantage  that  it  is  partic¬ 
ularly  easy  to  learn  to  use,  thus  minimizing  the  amount 
of  time  that  must  be  invested  to  learn  the  system. 

The  printer  selected  is  integrated  with  the  computer. 
It  is  a  modern  dot-matrix  printer  that  has  good  resolu¬ 
tion  and  is  capable  of  printing  high-resolution  graphics 
as  well  as  text  without  additional  hardware.  A  second 
disk  drive  is  an  essential  component  of  the  system.  The 
1200  baud  modem  is  the  means  for  communication  with 
the  remote  timesharing  computer  on  which  the  statisti¬ 
cal  software  will  be  accessed.  Such  devices  generally  can 
operate  at  cither  1200  or  at  300  baud  interchangeably. 

5.  The  set  of  Software  Tools 

As  with  the  hardware,  other  software  components 
could  serve  our  purposes  to  similar  effect;  the  partic¬ 
ular  combination  discussed  here  works  particularly  well 
with  the  Macintosh  and  with  each  other. 

Vorsatorni.  This  program  is  a  terminal-emulation 
program  which  makes  it  possible  to  use  the  hardware 
of  the  previous  section  as  if  it  were  a  very  smart  ter¬ 
minal.  Versaterm  can  (simultaneously)  emulate  both  a 
standard  video  display  terminal,  the  DEC  VT100,  and  a 
standard  graphics  terminal,  the  Tektronix  4010.  More¬ 
over,  the  terminal  emulator  can  run  at  9600  baud  using 
dedicated  communications  lines.  Thus,  at  a  minimum, 
the  DAMSL  system  can  simply  act  as  the  terminal  on 
which  data  analysis  is  done.  But  Versaterm  has  many 
features  which  are  particularly  useful  in  data  analysis. 
Several  of  these  are  illustrated  in  Figure  1.  They  are 
selected  by  pointing  the  mouse  to  particular  choices  in 
a  menu,  which  temporarily  overlaps  the  main  display. 


First,  one  can  save  the  transcript  of  the  terminal  ses¬ 
sion  to  disk,  either  for  later  perusal  or  for  archival  pur¬ 
poses.  This  transcript  includes  everything  that  tran¬ 
spires  on  the  screen,  including  both  input  and  output. 
Thus,  one  has  the  option  of  retaining  everything  that 
one  would  otherwise  have  from  a  session  conducted  at  a 
(much  slower)  hard-copy  terminal.  What  is  more,  this 
stream-saving  feature  can  be  turned  on  or  off  at  any 
time  by  moving  the  mouse  to  the  “Save  Stream”  item 
on  the  menu,  so  that  is  can  be  used  selectively. 

A  second  related  feature  is  that  the  stream  (of  both 
input  and  output)  can  also  be  echoed  to  the  attached 
printer.  Thus,  if  the  data  analyst  knows  that  he  or 
she  will  want  particular  results  for  permanent  reference 
(or  even  frequent  reference  during  the  terminal  session), 
that  work  can  be  printed  automatically  as  it  is  gener¬ 
ated.  This  feature,  too,  can  be  toggled  at  any  time  by 
selecting  the  item  “Print  Stream”. 

Versaterm  (when  configured  as  in  DAMSL)  also  re¬ 
tains  eighteen  previous  VT100  screens,  as  well  as  the 
current  screen.  The  information  on  these  screens  really 
should  be  thought  of  as  a  single  screen  of  some  450  lines, 
any  24  of  which  are  visible  at  a  single  time.  The  par¬ 
ticular  24  lines  are  determined  by  the  position  of  the 
scroll  bar  on  the  right-hand  side  of  the  display.  The 
white  box  can  be  thought  of  as  an  elevator  car  whose 
position  in  the  elevator  shaft  corresponds  to  position  in 
the  450-line  terminal  memory.  The  mouse  can  be  used 
at  any  time  to  move  the  elevator  box  to  any  desired  po¬ 
sition.  It  is  also  possible  to  scroll  forward  or  backward 
a  line  at  a  time.  This  feature  makes  it  very  easy  to  go 
back  to  a  recently  performed  part  of  the  analysis  whose 
importance  may  not  have  been  obvious  at  the  time. 

In  conjunction  with  the  scrolling  feature,  it  may  tun 
out  to  be  useful  to  have  a  hard  copy  of  a  portion  of  the 
analysis,  say  a  plot  of  the  data,  or  a  listing  of  potential 
outliers,  or  a  table  of  summary  statistics.  Versaterm 
makes  it  easy  to  go  back  to  that  portion  of  the  output, 
select  any  number  of  lines  from  it,  and  then  to  print  the 
selection  on  the  printer.  Such  a  selection  could  also  be 
saved  on  a  disk  file  on  the  Macintosh  as  well. 

In  addition  to  the  features  listed  above,  Versaterm 
has  a  separate  graphics  screen  for  Tektronix  emulation, 
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from  which  it  is  possible  to  produce  quite  adequate  hard 
copies  of  graphics  output.  The  Tektronix  output  can 
also  be  saved  to  disk,  for  later  processing. 

Finally,  Versaterm  has  several  protocols  for  transfer¬ 
ring  files  between  the  Macintosh  and  other  computers, 
including  Xmodem  and  Kermit  protocols.  (The  text 
for  this  paper,  for  instance,  was  written  on  the  Mac¬ 
intosh  and  transferred  using  Versaterm  to  a  computer 
with  typesetting  software.) 

Microsoft  Word.  The  second  workhorse  of  the 
DAMSL  system  is  a  word  processing  program,  some 
of  whose  features  are  illustrated  in  Figures  2  and  3. 
Word  is  one  of  several  similar  programs  available  for 
the  Macintosh  which  allow  use  of  multiple  fonts,  resiz¬ 
able  windows,  and  a  screen  display  which  corresponds 
very  closely  to  printed  output.  The  features  of  Word 
which  make  it  the  choice  for  DAMSL  are  its  ability  to 
have  up  to  four  windows  open  simultaneously,  in  which 
four  separate  documents  can  be  processed,  and  the  fact 
that  Word  has  keyboard  equivalents  for  most  options 
which  can  be  selected  by  the  Mouse. 

As  Figure  2  illustrates,  a  single  document  can  be 
viewed  in  two  different  places  at  once,  by  splitting  a 
window,  so  that  comments  can  be  typed  in,  say,  the 
lower  window,  while  observing  the  contents  of  the  up¬ 
per  window.  Figure  3  demonstrates  that  several  differ¬ 
ent  windows,  with  quite  different  contents,  can  be  in 
use  simultaneously.  By  clicking  the  mouse  on  any  one 
of  the  windows,  that  window  becomes  the  active  win¬ 
dow;  by  clicking  twice  on  the  top  of  the  window  next  to 
the  title,  the  window  automatically  increases  in  size  to 
occupy  the  whole  screen.  Another  double-click  returns 
the  window  to  its  smaller  size  and  position. 

It  is  possible,  with  a  few  mouse  movements,  to  move 
or  copy  any  portion  of  any  window  into  any  position 
in  any  other  window.  Additions  and  insertions  again 
require  only  a  mouse  click  to  initiate.  Changes  of  font — 
both  style  and  size — can  be  accomplished  either  with 
the  mouse  or  with  one  or  two  keystrokes. 

Switcher.  Although  most  of  the  real  work  is  done  by 
Versaterm  or  by  Word,  the  ingredient  that  makes  the 
system  work  is  the  program  Switcher.  Switcher  is  a  pro¬ 


gram  which  makes  it  possible  to  run  different  programs 
on  the  Macintosh  simultaneously.  In  DAMSL ,  we  run 
a  word-processing  program  (Word)  in  one  area  of  the 
Mac’s  memory,  and  a  terminal  emulator  (Versaterm)  in 
the  other.  Each  program  acts  as  if  it  had  the  Mac  to  it¬ 
self  when  it  is  the  active  program,  and  each  has  its  ow  i 
set  of  windows  in  which  computation  is  done.  Switcher 
makes  it  possible  to  move  from  one  program  to  the  other 
in  less  than  a  second,  by  simply  clicking  the  mouse  once, 
or  by  typing  a  single  key  (your  choice).  What  is  more,  it 
is  possible  to  move  contents  of  one  program’s  windows 
into  those  of  the  other  programs.  Thus,  for  instance,  I 
can  copy  a  scatterplot  created  by  Minitab  in  Versaterm 
directly  into  the  middle  of  a  manuscript  I  am  working 
on  using  Word.  Alternatively,  I  can  extract  a  Minitab 
command  embedded  in  a  Word  window  (such  as  a  line 
from  the  “glucose  mtab  commands”  window  in  Figure  3) 
and  cause  it  to  be  executed  by  the  mainframe  program 
being  run  using  Versaterm. 

Switcher  makes  two  things  possible:  to  change  focus 
almost  instantaneously  from  a  program  where  work  is 
carried  out  (Versaterm)  to  a  program  where  information 
about  the  work  is  recorded  (Word)  and  vice  versa,  and 
to  move  information  back  and  forth  between  these  two 
areas  of  focus. 

6.  DAMSL:  AN  INTEGRATED  SYSTEM 

DAMSL  is  a  somewhat  strained  acronym  standing 
for  Data-Analysis  Management  and  Strategy  Liberator. 
The  DAMSL  system  consists  not  only  of  its  individ¬ 
ual  hardware  and  software  components,  but  the  ways  in 
which  they  are  integrated  with  one  another  and  with  un¬ 
derlying  statistical  software  such  as  Glim  and  Minitab 
running  on  a  remote  computer.  The  integration  is  ac¬ 
complished  through  the  environment-switching  capabil¬ 
ities  of  Switcher,  the  multiple  text-window  capabilities 
of  Word,  and  the  graphics,  marking,  saving,  and  file- 
transfer  capabilities  of  Versaterm.  Viewed  in  the  large, 
DAMSL  can  be  thought  of  as  providing  up  to  six  dis¬ 
tinct  windows  onto  the  data  analysis  process,  within 
each  of  which  separate  aspects  of  the  overall  task  can 
be  accomplished.  They  make  it  possible  to  focus  more 
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easily  on  just  one  aspect  of  the  data  analysis  at  a  time, 
yet  moving  from  one  aspect  to  another  involves  no  more 
than  a  second  to  accomplish.  When  these  tasks  overlap 
or  interact,  information  can  be  transferred  readily  from 
the  context  in  which  it  is  generated  to  another  context 
in  which  it  may  prove  useful. 

For  managing  the  data-analysis  process,  and  for  mon¬ 
itoring  and  evaluating  statistical  strategies,  the  key  lies 
in  the  ways  these  windows  and  the  other  features  of  the 
constituent  software  are  used.  These  are  perhaps  best 
illustrated  with  reference  to  an  example. 

The  example  involves  the  reanalysis  of  a  published 
data  set  from  Smith  and  Choi  (1982)  concerning  glu¬ 
cose  metabolism  in  26  healthy  male  volunteers.  Each 
was  given  a  standard  glucose  challenge  dosage,  and  the 
levels  of  plasma  glucose  (mg/dl)  were  recorded  one  hour 
and  three  hours  after  the  challenge  (X2  and  X3).  In  ad¬ 
dition,  each  subjects  weight  in  pounds  was  also  recorded 
(Xi).  Smith  and  Choi  used  the  data  to  illustrate  a  test 
comparing  two  dependent  regression  lines;  they  con¬ 
cluded  that  the  regressions  of  X2  on  Xi  and  of  X3  on 
X\  were  different.  The  question  leading  to  the  reanal¬ 
ysis  was  whether  some  simple  model  could  be  found 
that  satisfactorily  represented  the  relationship  of  glu¬ 
cose  metabolism  to  weight.  Although  this  is  a  relatively 
simple  problem,  it  is  sufficient  to  illustrate  many  of  the 
ways  in  which  DAMSL  can  assist  in  the  data  analysis 
process.  Because  Minitab  82.1  is  widely  familiar,  I  shall 
use  it  as  the  statistical  package  for  my  data  analysis. 

The  first  thing  that  we  do  is  to  insert  two  disks  con¬ 
taining  the  DAMSL  system  into  the  computer;  the  Mac¬ 
intosh  shows  the  contents  of  the  two  disks  as  in  Figure  4. 
One  disk  contains  Switcher  and  Word  (DAMSL  Mas¬ 
ter),  the  other  contains  Versaterm,  a  file  called  DAMSL, 
and  other  files  and  documents  related  to  the  current 
data  analysis.  Generally,  the  first  disk  is  the  same  for 
all  applications,  while  each  different  project  will  have  its 
own  incarnation  of  the  second  disk.  To  start  DAMSL, 
one  clicks  twice  on  the  file  named  DAMSL.  This  au¬ 
tomatically  starts  Switcher,  which  in  turn  starts  both 
Word  and  Versaterm  in  areas  of  memory  of  prespecified 
size  (160K  for  Word  and  256 K  for  Versaterm).  Once 
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I  omitted  about  30  lines  here  from  »r  notes.  I  had  neglected  to  tell 
Hinitab  about  ay  terminal  (vhich  effectively  has  an  infinite  screen 
rather  than  a  24-line  screen),  causing  it  to  interrupt  output  in  the  middle 
of  a  histogram. 
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FIGURE  5 


Switcher  has  been  launched,  at  the  upper  left  corner  of 
the  screen  is  a  pair  of  arrows.  By  moving  the  mouse  to 
either  arrowhead  and  clicking  once,  Switcher  moves  in 
the  next  application  program.  Clicking  the  arrow  now, 
for  instance,  immediately  moves  us  to  Word. 

The  next  step,  then,  is  to  enter  Word  and  to  set  up 
three  windows.  The  first  I  call  “glucose  notes,”  the  sec¬ 
ond  “glucose  variables,”  and  the  third  “glucose  mtab 
commands.”  These  will  contain,  respectively,  a  running 
commentary  on  the  most  important  aspects  of  the  anal¬ 
ysis  in  progress;  a  workspace  in  which  I  can  keep  track 
of  the  contents  of  the  Minitab  worksheet  and  its  vari¬ 
ables,  matrices,  and  constants;  and  an  area  in  which  I 
can  keep  track  of  the  sequence  of  Minitab  commands 
that  I  have  used  in  the  analysis.  This  uses  three  of 
Word’s  four  available  windows.  I  reserve  the  fourth  for 
possible  temporary  use  later  in  the  analysis.  The  Wot  I 
side  of  DAMSL  is  now  ready  to  go. 

We  then  switch  to  Versaterm,  and  immediately  choose 
the  option  to  “Save  Stream”  under  the  File  menu.  This 
means  that  our  entire  session  with  the  remote  computer 
will  be  automatically  recorded  on  the  Macintosh  disk. 
We  can  then  examine  it  later,  or  print  it  out,  or  discard 
it,  or  even  edit  it.  Next,  we  dial  the  remote  computer 
(by  selecting  an  item  from  the  Phone  menu),  and  we 
login  as  usual  and  launch  Minitab. 

The  preliminaries  come  first:  we  load  the  dataset  (al¬ 
ready  saved,  for  convenience,  in  a  Minitab  worksheet), 
and  ask  for  the  brief  information  stored  with  the  work¬ 
sheet.  This  information  I  select  using  the  mouse  and 
then  I  copy  it  into  the  Word  window  “glucose  variables” , 
where  it  can  serve  as  a  constant  reminder,  both  of  the 
contents  and  the  source  of  the  data  set.  I  then  return 
to  Versterm,  this  entire  process  having  taken  just  a  few 
seconds. 

As  a  matter  of  course,  I  always  obtain  simple  de¬ 
scriptive  statistics  for  the  variables  in  the  data  set,  and 
histograms  as  well.  This  is  starting-point  information, 
and  may  well  be  referred  to  at  many  later  stages  of  the 
analysis.  I  select  this  information,  and  copy  it  into  the 
Word  window  “glucose  notes”,  and  the  result  is  shown 
in  Figure  5.  Actually,  as  the  figure  makes  clear,  about 
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Figure  6 


thirty  lines  of  the  terminal  session  were  not  transferred 
to  “glucose  notes”.  I  had  made  a  trivial  mistake  in 
Minitab  which  I  had  no  reason  to  perpetuate.  The  win¬ 
dow  “glucose  notes”  is  to  contain  only  those  parts  of 
the  analysis  which  are  important  or  useful  in  my  chain 
of  reasoning  about  the  data;  I  edit  out  that  which  is 
useless,  and  I  do  so  “on  the  fly.”  When  l  review  the 
analysis  (after  having  taken  a  lunch  break,  for  instance) 
I  examine  “glucose  notes”  rather  than  leafing  through 
the  last  400  lines  of  computer  output. 

Already,  I  have  begun  to  organize  and  to  assist  my 
own  thinking  about  the  problem.  In  the  sequel,  I  do 
so  in  a  moderately  systematic  way.  Note  that  Figure  5 
contains  material  in  two  different  fonts— a  monospace 
font  (Monaco)  in  which  Minitab  input  and  output  is 
displayed  and  a  boldface  font  (New  York)  in  which  1 
have  inserted  commentary  about  the  computing  pro¬ 
cess.  I  use  an  additional  lightface  font  (Geneva,  seen  in 
figure  6)  for  my  plans  and  for  my  own  analysis.  These 
fonts  visually  represent  different  aspects  of  the  compu¬ 
tations  I  undertake,  and  I  use  them  systematically. 

Before  going  to  Minitab  to  do  a  sequence  of  com¬ 
putations,  I  jot  down  a  few  notes  in  “glucose  notes” 
concerning  what  I  am  about  to  do,  and  why,  using  the 
Geneva  font.  I  then  follow  this  idea  in  Minitab  until  I 
have  done  what  I  set  out  to  do,  or  until  it  seems  as  if 
my  plan  is  changing  somewhat.  At  that  point,  I  pause, 
and  I  copy  relevant  portions  of  the  Minitab  output  into 
“glucose  notes”,  using  the  Monaco  font.  I  then  record 
in  a  few  words  my  interpretation  of  what  I  have  seen, 
and  how  that  has  changed  my  view  of  what  should  be 
examined  next  (using  Geneva  again).  It  is  also  a  trivial 
matter  to  annotate  the  Minitab  output  that  has  been 
copied  to  “glucose  notes”,  and  it  is  often  useful  to  do 
so.  If  a  regression  coefficient  in  one  analysis  is  close  to 
its  theoretical  value,  I  can  note  that  fact  “on  the  out¬ 
put,”  as  it  were,  and  I  can  do  so  in  real  time.  I  then 
write  down  a  few  words  about  what  I  am  about  to  do 
next,  and  then  I  repeat  the  cycle.  This  is  illustrated  in 
Figure  6,  in  which  an  interpretation  of  a  plot  is  coupled 
with  its  implications  for  what  to  do  next. 
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In  copying  the  Minitab  output  to  “glucose  notes”,  I 
copy  only  those  portions  which  are  important  in  shap¬ 
ing  my  understanding  of  the  problem  or  of  the  course 
of  the  analysis.  Other  aspects  are  merely  summarized; 
for  such  summaries,  I  use  boldface  type.  This  is  illus¬ 
trated  in  Figure  7.  Note  that  the  boldface  entries  record 
something  I  looked  at,  but  which  did  not  contribute  to 
the  unfolding  story.  Using  this  system  dead  ends  don’t 
simply  disappear,  they  actually  contribute  to  the  over¬ 
all  understanding  of  the  problem.  At  the  same  time, 
the  landscape  is  not  cluttered  with  fragments  of  com¬ 
putation  that  must  be  waded  through  in  order  to  recon¬ 
struct  an  analysis.  When  used  in  the  manner  outlined 
here,  “glucose  notes”  represents  an  implementation  of 
the  “three-ring  binder”  ideas  of  This  ted  (1986). 

If  at  any  time  it  becomes  clear  that  there  are  several 
lines  of  attack  to  examine,  I  write  them  all  down  in  a 
list  as  part  of  “glucose  notes”,  and  then  I  copy  the  list 
to  a  new  (fourth)  window,  in  which  I  keep  temporary 
notes — in  this  case,  a  list  of  things  “to  do.”  As  I  do 
each  one,  I  can  check  it  off  in  this  fourth  window. 

Occasionally,  something  passed  over  as  insignificant 
(and  hence  not  copied  to  “glucose  notes”)  will  later  turn 
out  to  be  relevant.  The  terminal  memory  is  usually 
sufficient  to  cope  with  the  problem;  anything  done  fewer 
than  450  lines  ago  can  be  retrieved  in  a  few  seconds  and 
pasted  into  its  proper  place  in  “glucose  notes”,  along 
with  the  corresponding  notes  about  its  importance  and 
how  it  came  to  be  realized.  If  the  450-line  boundary  has 
passed,  however,  there  is  still  the  complete  transcript 
of  the  terminal  session  that  has  been  silently  recorded 
throughout  which  can  be  used  to  retrieve  anything  of 
interest  that  transpired  during  the  session. 

One  of  the  three  windows  that  we  originally  set  up 
has  not  yet  been  mentioned.  It  is  sometimes  useful  to 
record  the  sequence  of  Minitab  commands  used  in  an 
analysis.  These  can  easily  be  extracted  by  copying,  say, 
several  hundred  lines  at  a  time  from  Versaterm  mem¬ 
ory  into  “glucose  mtab  command”,  and  then  in  a  sin¬ 
gle  pass,  removing  the  Minitab  output.  Remarkably 
Word  makes  it  possible  to  do  this  with  little  effort,  even 


though  it  is  done  “manually.”  What  is  more  generally 
useful  is  to  copy  the  contents  of  “glucose  notes”  (which 
contains  Minitab  input  as  well  as  output  and  comments) 
into  “glucose  mtab  command”,  and  then  to  strip  out 
the  non-commands.  This  was  done,  for  instance,  in  the 
front-most  window  of  Figure  3.  If  a  sequence  of  com¬ 
mands,  once  done,  appears  to  be  particularly  useful,  it 
can  be  copied  into  “glucose  mtab  command”.  There, 
the  Minitab  prompts  can  be  removed,  the  commands 
edited,  and  then  the  whole  sequence  can  be  copied  into 
a  Minitab  EXEC  file  using  the  Minitab  STORE  com¬ 
mand  on  the  Versaterm  side. 

We  have  mentioned  in  passing  “glucose  variables”. 
In  this  window,  we  keep  track  of  transformed  variables, 
subsets  of  the  data,  and  the  contents  of  Minitab  work¬ 
sheets  that  we  create  and  save  during  the  course  of  the 
analysis.  It  is  here  that  we  can  emulate  some  of  (he 
features  of  “save  states”  discussed  in  Cowley  and  Whit¬ 
ing  (1985),  by  recording  the  contents  of  various  saved 
worksheets  (in  efTect,  Minitab  snapshots)  and  their  re¬ 
lationship  to  one  another. 

7.  DISCUSSION 

The  value  of  DAMSL  is  that  it  makes  it  possible  to 
isolate  different  aspects  of  the  process  of  data  analy¬ 
sis  in  separate  “windows.”  As  a  result,  the  data  ana¬ 
lyst  need  not  keep  so  many  things  in  view  all  at  once. 
Moreover,  it  automates  (or  at  least  simplifies)  many  of 
the  bookkeeping  and  other  managerial  tasks  associated 
with  data  analysis.  Because  it  is  not  fully  automatic — it 
is  a  dumb  assistant — using  it  effectively  depends  upo  l 
the  data  analyst  adopting  a  certain  (mild)  disciplined 
approach  to  data  analysis.  As  with  the  discipline  intro¬ 
duced  when  using  structured  programming  techniques 
to  write  computer  programs,  this  discipline  serves  to  or¬ 
ganize  the  analysis  both  mentally  and  computationally. 

Some  of  the  benefit  of  data  analysis  auditing  (see 
Becker  and  Chambers,  1986)  can  be  achieved  using  the 
“glucose  mtab  command”  window.  At  the  very  least., 
this  makes  it  possible  to  obtain  a  diary  for  any  inter¬ 
active  statistical  package,  which  could  then  be  used  as 
meta-data  for  further  study  of  the  data  analysis  process. 

Many  statistical  packages  have  diary  or  journal  fea¬ 
tures  which,  coupled  with  the  ability  to  insert  comment- 
commands  could  serve  some  of  the  purposes  to  which 
DAMSL  has  been  put.  What  distinguishes  DAMSL 
from  what  is  available  in  S,  say,  is  twofold.  First,  the 
process  of  annotation  is  carried  out  in  a  different  “place” 
than  is  the  analysis  proper.  This  makes  it  easier  men¬ 
tally  to  focus  on  the  separate  activities  of  analysis,  plan¬ 
ning,  and  interpretation  on  the  one  hand,  and  the  actual 
computation  on  the  other.  When  one  is  typing  com¬ 
ments  into  S,  one  is  always  aware  that  comments  are 
being  inserted  into  S—  omit  a  single  #  mark  and  try  to 
recover  your  train  of  thought!  Second,  in  DAMSL  bol  i 
the  input  and  the  output  are  available  for  annotation 
and  editing.  Thus,  one  can  “mark”  on  the  output,  or 
eliminate  output  that  has  been  looked  at  but  found  not 
to  be  of  further  interest. 

The  contents  of  the  three-ring  binder  (“glucose  notes” 


in  our  example),  particularly  when  compared  against 
the  entire  session  transcript,  can  be  a  powerful  tool  for 
reflecting  on  the  strategies  adopted  in  the  analysis  and 
on  their  effectiveness. 
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STATISTICALLY  SOPHISTICATED  SOFTWARE  AND  DINDE 
R.W.  Oldford  and  S.C.  Peters,  Massachusetts  Institute  of  Technology 


Abstract 

We  describe  a  prototype  system,  which  we  call 
DINDE,  and  the  directed  network  model  of  statistical 
analysis  on  which  it  is  currently  based.  DINDE  is  a 
highly  interactive  display  oriented  system  where  the 
user  carries  out  the  analysis  by  building  and 
maintaining  a  network  representation  of  it.  An 
example  analysis  is  used  to  describe  this  interaction 
and  the  analysis  management  tools  required. 

1.0  Introduction 

By  statistically  sophisticated  software,  we  do  not 
mean  software  that  implements  a  sophisticated 
statistical  method,  but  rather,  software  that 
contains  information  on  how  and  when  that  method 
is  most  frequently  used  in  practice. 

As  shown  in  Figure  l,  there  are  at  least  three 
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Figure  1:  Three  interrelated  objectives 

reasons  why  one  might  attempt  implementation  of 
such  software:  first,  to  have  software  which  guides 
the  user  to  a  better  statistical  analysis  (e.g.  Gale  and 
Pregibon[1982],  Oldford  and  Peters[1984]);  second, 
to  use  the  software  as  a  medium  for  studying 
statistical  strategy  (e.g.  Pregibon[1985]  or  Oldford 
and  Peters[1984,1985ab]);  and,  third,  to  have 
software  which  helps  the  user  manage  the  analysis 
(e.g.  Carr  et  al  [1984],  Becker  and  Chambers!  1985], 
Oldford  and  Peters[1985b]). 

These  objectives  are  interrelated  and  software 
for  one  often  leads  naturally  to  software  for  another. 
For  example,  to  successfully  guide  the  analysis  one 
needs  to  understand  and  implement  the  supporting 
statistical  strategies.  To  study  the  strategies  of  good 
statistical  data  analysis  with  software,  one  needs  to 
be  able  to  manage  the  analysis.  In  developing 
DINDE,  we  have  repeatedly  found  ourselves 
concentrating  on  each  objective  in  turn; 
advancement  toward  one  objective  has  often 
produced  insight  on  one  of  the  others. 

Whether  the  chief  interest  is  to  guide,  study,  or 
manage  the  analysis,  some  model  of  an  analysis  is 
required.  The  next  two  sections  address  this 
question  rather  generally.  The  remaining  sections 
describe  the  current  implementation  of  one  such 
model  in  DINDE.  Like  any  model,  the  current 
model  of  statistical  data  analysis  used  in  DINDE  is 
temporary  and  will  improve  with  experience.  The 
last  section  indicates  some  of  the  modifications  we 
already  foresee. 


2.0  What’s  an  Analysis? 

The  simplest,  and  least  satisfying,  view  of  a 
statistical  analysis  is  as  a  specified  sequence  of  steps. 
An  example  would  be  regression  modelling  by  a 
forward  selection  procedure,  where  variables  are 
added  to  the  current  regression  model  one  at  a  time 
according  to  some  criterion. 

This  model  of  statistical  analysis  seems  to 
underlie  the  batch-oriented  statistical  packages  of 
the  late  1960s  and  early  1970s,  or,  at  least,  to 
underlie  their  common  usage.  The  sequence  of  steps 
to  be  taken  in  the  analysis  is  defined  in  advance  and 
the  corresponding  set  of  packaged  routines  is  run. 
In  light  of  the  results,  this  sequence  may  be  modified 
and  the  resulting  new  set  of  procedures  run.  Each 
run  is  regarded  as  a  different  analysis.  The 
refinement  of  the  analysis  continues  until  the 
analyst  is  satisfied  that  the  final  one  is  "correct"  for 
the  problem.  In  this  paradigm,  novices  typically 
would  not  substantially  modify  their  original 
analysis. 

A  more  accurate  view  of  statistical  analysis, 
based  largely  upon  a  scientific  modelling  paradigm 
as  expressed,  for  example,  by  Box[1976],  is  that  it  is 
an  iterative  procedure  whereby  statistical  models 
are  alternatively  fitted  and  criticised. 

Unrolling  this  iterative  loop  produces  a  different 
kind  of  sequential  process,  one  whose  steps  are  not 
predetermined.  Instead,  the  analyst  decides  what  to 
do  next  based  upon  the  results  of  the  preceding  step. 
Rather  than  a  specified  sequence  that  is  continually 
refined,  this  model  of  an  analysis  is  a  dynamic 
sequence,  one  that  grows  as  more  is  learned  about 
the  problem  and  the  data.  The  analysis  is 
represented  by  the  entire  sequence,  not  just  by  a 
final  revision.  Here,  novices  would  typically 
produce  shorter  sequences  than  would  experts. 

As  such,  this  model  fits  well  with  modern 
interactive  statistical  systems  like  S  (Becker  and 
Chambers  [1984]),  where  each  step  of  the  analysis 
corresponds  to  a  command  issued  to  the  system. 
After  examining  the  results  of  each  command,  the 
analysis  is  grown  by  issuing  another  command.  A 
macro  facility  is  usually  available  to  allow  the  user 
to  compress  many  small  sequential  steps  in  the 
analysis  into  a  single  larger  one  that  is  easier  to 
comprehend,  and  use.  as  a  unit.  In  this  way.  new 
analysis  steps  are  defined  by  the  analyst.  Diaries 
reinforce  this  model  of  analysis  by  recording  the 
entire  sequence  of  steps. 

The  dynamic-sequence  model  suggests  that 
results  and  information  from  actions  taken  early  in 
the  analysis  influence  later  actions  only  through  the 
chain  of  steps  given  by  the  time  ordering.  But  this  is 
not  an  accurate  description  of  an  analysis.  At  any 
step,  a  number  of  different  actions  can  be,  and  often 
are,  taken.  A  model  of  analysis  based  only  on  the 
time  order  of  actions  hides  this  logical  relationship 
between  actions,  and,  hence,  is  seriously  incomplete. 

This  is  an  important  shift  in  focus:  from  the  time 
ordering  of  statistical  and  arithmetic  procedures  to 
the  conceptual  steps  of  the  analysis  and  the  logical 


$ 

$ 

4# 

'•/ 

w 

«l 

s 

I 

j 


connections  between  the  steps.  Now,  the  simplest 
model  of  the  analysis  is  a  tree,  where  branches 
indicate  a  logical  connection  between  one  step  and  a 
number  of  others.  For  example,  at  different  times, 
different  actions  or  decisions  may  be  taken  from  one 
step,  resulting  in  many  branches  from  it.  With  this 
model,  novice  analysts  would  likely  produce  short, 
sparse  trees  and  expert  analysts  long,  bushy  trees. 

A  little  reflection,  however,  shows  that  the  tree 
model  also  falls  short.  Suppose  that  two  branches  of 
the  tree  represent  two  different  sub-analyses  that 
are  pursued  in  parallel.  It  may  happen  that  a  new 
tack  is  taken  in  the  analysis  that  is  based  on  the 
combined  results  of  both  independent  sub-analyses. 
Where  should  this  new  sub-analysis  be  attached? 
The  obvious  answer  is  to  attach  it  to  both  of  the 
previous  ones,  forcing  the  whole  analysis  to  become 
a  directed  network  rather  than  a  tree. 

This  directed  network  model  of  statistical  data 
analysis  is  currently  the  basis  for  DINDE.  While  it 
provides  a  better  description  than  any  of  the 
previous  models,  it  too  has  shortcomings.  We 
discuss  some  of  these  in  the  last  section,  and  indicate 
our  planned  modifications  to  the  network  model. 

3.0  What  are  the  steps  again? 

Implicit  in  the  models  considered  above  is  the 
assumption  that  every  analysis  has  identifiable 
steps:  decision  points  where  some  action  is  taken. 
Further,  it  is  assumed  that  many  of  these  steps  are 
generic  enough  to  be  usefully  recorded.  What,  then, 
are  the  steps? 

With  current  statistical  systems,  the  steps  are 
equivalent  to  the  commands  that  are  issued  to  the 
system.  The  first  thing  to  notice  is  that  the  steps 
have  varying  granularity.  The  smallest  grains 
include  those  steps  where  the  actions  are  simple, 
arithmetic  ones  taken  on  scalars,  vectors,  matrices, 
and  the  like.  Good  statistical  systems  will  always 
allow  actions  to  be  taken  at  this  low  level  of 
analysis.  Larger  grains  include  strictly  statistical 
actions,  like  regressing  y  on  x,  where  the  lower  level 
steps  needed  to  accomplish  the  task  are  suppressed 
from  consideration.  The  regression  step  is  really  an 
abstraction  of  many  lower-level  analysis  steps,  an 
abstraction  that  becomes  a  powerful  tool  for  the 
analyst. 

More  abstract  steps  are  typically  more  powerful 
(i.e.  do  more  for  the  analyst),  but  also  have  more 
restricted  ranges  of  application  (e.g.  regression  is  a 
powerful  tool  but  has  smaller  range  of  application 
than  the  matrix  operations  used  to  construct  it).  In 
designing  useful  steps  for  statistical  analysis,  there 
is  always  a  potential  tension  between  the  range  and 
the  power  of  a  newly  proposed  step. 

In  DINDE,  our  working  philosophy  has  been  to 
begin  with  steps  that  are  reasonably  generic  and  to 
increase  their  power  in  two  ways  that  do  not  restrict 
their  range  of  application. 

First,  we  specialize  some  steps  to  become  more 
context  specific.  The  specialized  steps  are  to  be  used 
in  place  of  more  general  steps  when  the  context 
warrants  it.  Therefore,  the  necessarily  smaller 
range  of  applicability  of  the  specialized  steps  does 
not  inhibit  the  analysis.  Instead,  given  the  right 
context,  they  become  powerful  tools. 


Second,  in  each  step,  information  is  incorporated 
as  to  which  steps  are  often  taken  next.  This  simple 
addition  makes  the  step  more  powerful  at  no  loss  to 
its  range  of  application. 

In  DINDE,  commands  sometimes  produce  steps, 
but,  steps  are  never  commands.  The  steps  in  DINDE 
are  quite  different;  they  are  collection  points  for 
possible  actions  (commands).  Instead  of  specifying  a 
sequence  of  actions  to  be  taken,  the  abstraction  in 
DINDE  is  to  collect  together  a  set  of  possible  actions 
and  the  information  that  may  help  the  analyst 
decide  among  them.  A  sequence  of  actions  identified 
and  captured  in  DINDE  would  simply  be  a  more 
abstract  action,  not  a  step. 

At  present,  the  steps  that  have  been  developed  in 
DINDE  are  either  analysis  goals  or  analysis 
artifacts. 

For  example,  an  analysis  goal  might  be  a 
reasonable  description  of  the  regression  of  y  on  x  (i.e. 
the  conditional  expectation  of  y  given  x).  This  goal 
is  represented  in  DINDE  by  the  step 
BivariateRegression  (bivariate  since  two  variables 
are  involved).  A  variety  of  information  is 
immediately  available  at  this  step:  which  vectors  x 
and  y  refer  to,  what  set  of  actions  are  generally 
reasonable  to  take  next  (various  plots  and  fitting 
routines),  and  which  of  these  actions  it  is  considered 
wise  to  take  first  (visually  inspect  the  data  via  a 
scatterplot  and  histograms). 

Choosing  to  do  the  scatterplot  of  y  versus  x  would 
produce  a  Scatterplot  as  an  artifact.  Again,  since  it 
is  a  new  step,  particular  actions  would  be  made 
available  for  this  kind  of  artifact  (like  fitting  a 
straight  line  or  a  smooth  curve  to  the  points  in  the 
plot). 

Clearly,  a  number  of  steps  are  necessary  for  an 
analysis  with  even  the  relatively  simple  goal  of 
describing  the  regression  of  one  variable  on  another. 
The  challenge,  then,  is  to  make  the  steps  generic 
enough  to  be  useful  in  a  variety  of  analyses. 

4.0  DINDE 

The  challenge  of  DINDE  is  to  produce  and 
implement  a  model  for  statistical  analysis  that  is 
both  reasonably  accurate  and  natural  to  use. 
Sections  2  and  3  indicate  the  underlying  model; 
here,  and  in  the  sections  which  follow,  we  discuss  its 
implementation  and  use. 

DINDE  is  an  enrichment  of  an  extensive 
interactive  programming  environment:  Interlisp-D 
with  LOOPS  which  runs  on  the  Xerox  1108  personal 
workstation  (see  Teitelman  and  Masinter  [1981], 
Stefik  et  al  [1983]).  The  combination  of  high 
interaction,  extensive  graphics,  powerful  dedicated 
computing,  and  powerful  programming  tools 
available  in  this  environment  has  proved  to  be 
enormous  leverage  for  designing  and  building 
DINDE. 

The  hardware  is  described  in  more  detail  in 
Oldford  and  Peters  [1985b].  We  note  only  two 
features  here:  first,  an  illusion  of  an  infinitely  large 
memory  is  maintained  for  the  user  (1.5  -  3.5  MBytes 
of  real  memory,  32  MBytes  virtual),  and  second,  the 
display  is  a  high  resolution  bit-mapped  display 
(about  1000  by  1000  individually  addressable  pixels) 
that  can  be  interacted  with  using  a  "mouse" 
pointing  device. 


The  software  environment  is  at  least  as 
important  as  the  hardware.  Interactive 
programming  environments  are  the  most 
appropriate  and  productive  locales  fordoing  the  sort 
of  experimental  programming  that  is  involved  in 
building  a  system  like  DINDE  (and,  we  would 
argue,  in  carrying  out  statistical  data  analysis). 
This  point  is  forcefully  argued  by  Sheil  [1983].  We 
have  found  the  object-oriented  programming 
paradigm,  as  available  in  LOOPS,  to  be  especially 
useful:  it  is  the  backbone  of  DINDE  (see  Goldberg 
and  Robson  [1983],  Bobrow  and  Stefik  [1983],  or 
Stefik  and  Bobrow  [1985]  for  details  on  this 
programming  paradigm). 

In  the  object-oriented  paradigm,  there  are 
classes,  which  contain  generic  properties  and 
behaviours  for  a  large  "class"  of  individual  objects, 
and  objects,  which  are  individual  "instances"  of  a 
particular  class.  For  example,  the  class  Car  would 
represent  the  common  properties  and  behaviours  of 
all  Cars,  while  the  object  MyCar  would  be  a 
particular  instance  of  the  generic  class  Car,  as 
would  EdsCar,  KarensCar,  and  so  on.  In  DINDE, 
each  analysis  step  (goal  or  artifact)  is  represented  as 
a  class;  the  steps  actually  taken  in  a  particular 
analysis  are  objects,  instances  of  their  corresponding 
classes. 

The  idea  in  DINDE,  then,  is  to  select  the  kind  of 
step  (class)  which  one  wants  to  take  next  and  to 
incorporate  an  instance  of  it  (representing  the  step 
actually  taken)  at  an  appropriate  place  in  the 
analysis  (directed  network). 

The  set  of  possible  steps  (i.e.  classes)  that  can  be 
taken  in  an  analysis  are  displayed  in  an  interactive 
window  called  the  toolbox.  Using  the  mouse,  the 
analyst  selects  a  step  from  the  toolbox,  as  necessary, 
and  attaches  it  (now  an  object)  to  the  appropriate 
place  in  the  analysis. 

Recall  that  our  model  of  an  analysis  is  a  directed 
network  whose  nodes  are  the  steps  actually  taken. 
The  analyst  works  with  this  model  within  another 
window  called  an  analysis  map.  Here,  the  network 
representing  the  analysis  is  actually  displayed  and 
the  analysis  progresses  by  interacting  with  the 
network  via  the  mouse. 

This  mouse  interaction,  called  "mousing"  in 
what  follows,  is  possible  in  both  the  toolbox  and  the 
analysis  map.  In  both  windows  the  mouse 
behaviours  follow  some  general  principles.  Figure  2 
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Figure  2:  Mouse  sensitive  areas  of  DINDE  windows 


shows  a  generic  window  in  DINDE  and  the  basic 
mousing  that  can  be  done  with  them.  There  are  two 
mouse  sensitive  areas  in  these  windows:  the  title  bar 
and  the  body.  Mousing  in  the  title  bar  allows 
interaction  with  the  displayed  contents  as  a  whole; 
mousing  on  individual  objects  displayed  in  the  body 
allows  interaction  with  the  mouse-selected  object. 

The  mouse  that  is  used  on  the  Xerox  1108  has 
two  buttons,  yielding  three  different  combinations  of 
depressed  buttons:  left,  right,  and  middle  (both 
buttons)  depressed.  Depressing  any  one  of  these 
typically  causes  a  menu  to  pop  up  (at  the  mouse 
position)  from  which  one  of  a  number  of  items  can  be 
selected.  Selecting  an  item  causes  some  action  to  be 
taken. 

We  try  to  follow  the  principle  that  a  button 
combination  should  yield  a  similar  kind  of  menu 
regardless  of  what  is  being  moused.  So,  in  the 
toolbox  and  the  analysis  map,  left-buttoning  always 
produces  menus  whose  items  have  something  to  do 
with  either  accessing  or  storing  information  on  the 
thing  being  moused.  If  the  title  bar  is  moused,  then 
the  information  pertains  to  the  toolbox  or  analysis 
map  itself.  If  an  individual  object  in  the  body  is 
moused,  then  the  information  pertains  only  to  that 
particular  object.  Right-buttoning  always  brings  up 
a  menu  containing  items  that  allow  the  user  to 
manipulate  the  window  (move  it,  reshape  it,  shrink 
it,  etc.).  Middle-buttoning  causes  menus  to  appear 
whose  items  indicate  the  messages  that  the  thing 
selected  can  respond  to.  Typically,  these  will  be 
action  items  such  as  "fit  a  smooth  curve  to  your 
points”  if  the  thing  selected  is  a  Scatterplot  object. 

The  toolbox  and  the  analysis  map  are  discussed 
in  turn  in  the  next  two  sections. 

5.0  The  Toolbox 

All  of  the  classes  of  objects  that  can  be  used  in  a 
statistical  analysis  in  DINDE  are  arranged  in  the 
toolbox  and  displayed  to  the  user.  Figure  3  shows 
the  contents  of  the  toolbox  as  it  currently  exists  in 
DINDE. 

We  have  tentatively  established  a  coarse 
partition  of  the  possible  classes  into  five  basic 
element  types:  (i)  Data  (currently  represented  as 
Arrays  or  TreeStructure s),  (ii)  Graphics,  (iii) 
Situations,  (iv)  Models  (e.g.  probability  models),  and 
(v)  Tables.  To  date,  only  the  first  three  of  these  exist 
in  DINDE.  Only  these  were  necessary  to  build  the 
prototype  regression  analysis,  but  we  anticipate 
that  eventually  Model  and  Table  representations 
will  also  be  required. 

In  the  toolbox  of  Figure  3,  the  classes  are 
displayed  as  nodes  on  several  trees.  Traversing  the 
trees  from  left  to  right  is  equivalent  to  moving  from 
generic  steps  to  more  specialized  ones.  For  example, 
BooleanArrays,  StringArrays,  and  FloatArrays,  are 
all  specialized  Arrays  (specialized  to  have  array 
elements  whose  values  must  be  booleans,  strings, 
and  floating  point  numbers,  respectively).  Once 
more  tools  are  available  in  DINDE,  other 
arrangements  may  be  of  interest  (different  indexing 
depending  on  interest).  Indeed,  it  will  likely  become 
desirable  to  group  tools  together  into  smaller 
toolboxes,  or  toolkits,  as  the  number  of  tools  grows. 
At  present,  however,  the  classes  are  displayed  only 
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Figure  3:  The  DINDE  Toolbox 


according  to  their  specialization. 

In  general,  a  specialization  has  access  to  all  the 
information  and  actions  available  to  a  more  general 
step,  and  more.  Thus,  what  seems  to  be  a 
counterintuitive  relationship  between  a  FloatMatrix 
and  a  FloatVector  makes  sense,  because  a 
FloatMatrix  has  access  to  actions,  such  as  taking  the 
singular  value  decomposition  of  itself,  which  would 
make  little  sense  for  a  FloatVector. 

Those  specializations  which  have  been 
undertaken  to  increase  the  power  of  steps  are  more 
intuitive.  For  example,  ResidualScatter  is  a 
specialization  of  Scatterplot  which  always  has 
residuals  plotted  along  the  vertical  axis.  Given  this 
context.  ResidualScatter  has  access  to  actions  which 
would  not  make  sense  for  an  arbitrary  Scatterplot. 
like  the  ability  to  smooth  the  positive  and  negative 
values  of  the  vertical  coordinates  (although 
smoothing  all  the  vertical  coordinates  will  be 
available  to  both  kinds  of  plots).  Similarly,  the 
ResidualVsFit  plot  has  access  to  actions  which  are 
helpful  in  determining  whether  the  residual  error  is 
heteroscedastic. 

Notice  that  these  relations  between  the  classes 
do  not  really  conform  to  trees.  LOOPS  allows  one  to 
create  classes  that  are  specializations  of  more  than 
one  class  (e.g.  ResidualVslndex).  Such  classes  are 
identified  by  having  a  box  drawn  around  them 
whenever  they  are  repeated  in  the  display.  They 


have  access  to  all  information  and  actions  that  are 
available  to  any  of  their  "parent"  classes. 

This  ability  to  mix  together  classes  encourages 
the  abstraction  of  common  aspects  of  different  kinds 
of  analysis  steps.  The  abstractions  are  then 
represented  as  generic  classes  that  can  be  usefully 
"mixed  into"  more  than  one  step.  This  is  perhaps 
one  of  the  most  challenging  aspects  of  creating  a 
sophisticated  system  like  DINDE.  It  requires  an 
identification  and  grouping  of  the  elements  that  are 
practically  important  in  a  statistical  analysis. 

The  nature  of  this  research  can  be  seen  by 
considering  those  analysis  steps  that  are  classified 
as  Situations.  There  are  five  different  classes,  of 
which  only  one  can  really  be  regarded  as  a  goal 
(BivariateRegression),  the  others  are  better 
described  as  artifacts. 

The  analysis  step  BivariateRegression  represents 
the  goal  of  regressing  y  on  x.  It  contains  the 
necessary  information  on  which  vector  is  y  and 
which  is  x,  has  access  to  various  plotting  methods 
( Scatterplot  and  Histograms  of  y  and  x),  various 
fitting  procedures  (a  straight  line  via  least-squares 
or  an  outlier  resistant  procedure  and  a  running 
linear  least-squares  smooth).  Selecting  any  of  these 
actions  will  produce  an  artifact  (a  plot  or  a  fit),  and 
hence  a  new  analysis  step.  There  are  four  kinds  of 
fit  artifacts.  In  order  of  increasing  specialization 
they  are  as  follows:  BivariateFit ,  which  is  used  to 
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represent  arbitrary  fits  (such  as  those  from 
smoothers),  contains  information  like  y,  x,  the 
residuals,  and  the  fitted  values,  and  has  its  own  set 
of  actions,  including  the  production  of  various 
residual  plots;  BivanateLinearFit  which,  in  addition 
to  the  information  and  actions  available  to  it  from 
BivanateFit,  also  contains  a  slope  and  intercept  for 
the  fitted  line;  BicariateLeastSquares  and 
BivariateResistantFit.  each  of  which  has  access  to 
more  specialized  information  that  is  relevant  to  its 
particular  fitting  procedures  (e.g.  R2  or  t-statistics 
for  BivariateLeast-  Squares ).  Other  factorizations 
are  certainly  possible. 

In  the  toolbox,  information  is  available  on  any  of 
these  steps.  Selecting  the  class  in  question  with  the 
left  mouse  button  depressed  produces  the  menu 
shown  below  in  Figure  4. 
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IntemaJOescription 

LonaSummary 

Figure  4:  Left-button  Menu  for  a  Class 


By  selecting  the  appropriate  item  with  the 
mouse,  the  user  can  get  a  short,  or  long,  summary  on 
that  class,  find  out  what  variables  it  requires,  get 
relevant  references  to  the  statistical  literature,  find 
out  the  classes  from  which  it  gains  access  to  its 
information  and  actions,  and  a  description  of  its 


internal  software  structure.  (Note  that  a  right 
arrow  on  a  menu  item  indicates  that  a  more  detailed 
menu  can  be  had  by  sliding  the  mouse  to  the  right, 
along  that  item.) 

Left-buttoning  in  the  title  bar  of  the  toolbox 
produces  a  menu  offering  the  user  information  on 
the  toolbox  itself,  its  contents,  and  the  associated 
mouse  behaviours.  Other  mouse  buttons 
manipulate  the  display. 

Rarely,  in  the  course  of  an  analysis,  should  the 
user  need  to  retrieve  tools  directly  from  the  toolbox. 
If  the  classes  are  defined  well,  then  the  tools  made 
available  through  the  actions  at  any  given  step 
should  suffice. 

6.0  Analysis  Maps 

Figure  5  shows  an  analysis  of  the  relationship 
between  the  average  brain  and  body  weights  of  62 
mammals  (taken  from  Becker  and  Chambers 
[  1984]).  Each  node  in  the  network  represents  a  step 
in  the  analysis.  The  label  displayed  at  each  node  is 
supplied  by  the  object  it  represents,  and  consists  of 
the  name  of  the  object  (if  there  is  one),  its  class,  and, 
occasionally,  capsule  information  on  its  contents 
(e.g.  BodyWts  is  the  unique  name  of  a  particular 
FloatVector  object  having  62  elements). 

Mousing  on  a  node  permits  interaction  with  the 
object  it  represents.  For  example,  selecting  a  node 
with  the  left  button  down  causes  the  menu  of  Figure 
6  to  appear.  This  menu  allows  information  to  be 
either  added  to,  or,  retrieved  from,  the  selected 
object. 
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Figure  5:  An  Analysis  Map 
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Figure  6:  Left-button  Menu  for  an  Object 

Information  is  added  either  by  giving  the  object  a 
meaningful  and  unique  name,  or,  by  adding  notes  to 
the  object  (the  word  processing  capability  of 
Interlisp-D  used  to  construct  this  paper  is  made 
available).  Both  of  these  can  be  used  to  make  the 
analysis  easier  to  understand:  the  name  at  the  node 
can  make  the  display  easier  to  follow,  and  the  notes 
can  record  the  analyst's  observations  on  some  facet 
of  the  analysis. 

We  consider  the  ability  to  make  notes  to  be 
important  enough  that  we  include  Memo  objects  as 
possible  steps  in  the  analysis.  In  Figure  5,  a  Memo, 
called  WhyLogs?,  was  inserted  between  the  two 
original  data  vectors,  BodyWts  and  BrainWts,  and 
the  two  derived  FloatVector* ,  LnBodyWts  and 
LnBrainWts.  The  latter  two  are  the  natural 
logarithms  of  the  raw  data,  so  the  Memo  is  used  to 
record  the  reasons  for  making  the  transformation. 

Information  is  accessed  in  a  variety  of  ways:  by 
reading  the  user-recorded  notes  (ReadNotes),  by 
printing  a  short  summary  on  the  class  of  the  object 
(ShortSummary),  by  inspecting  the  internal  program 
structure  of  the  selected  object  (Inspect),  and,  by 
examining  the  detail  contained  in  that  node  (Zoom). 

The  last  of  these  is  uniformly  used  in  DINDE  to 
access  further  detail  on  any  node  in  the  analysis. 
"Zooming"  on  a  Graphic  (e.g.  Scatterplot  or 
QQGauss)  will  cause  a  window  containing  the  plot 
to  appear.  Zooming  on  other  objects  produces  a 
window  containing  the  "Zoomed"  object  and  the  data 
it  can  access.  Figure  7  shows  the  effect  of  Zoom  on 
the  BivariateLeastSquares  node  of  the  map  in  Figure 


5.  (As  in  other  DINDE  windows,  the  mouse  provides 
convenient  way  interaction  with  the  displayed 

objects.)  .  „  ....  .  . 

The  most  important  use  of  the  Zoom  facility  is  to 

retrieve  the  details  of  some  sub-analysis. 
Sub-analyses  can  be  represented  in  DINDE  as 
objects  called  SubMaps,  an  example  being  the  one 
named  DeadEndRegression  in  Figure  5.  Zooming  in 
on  this  SubMap  produces  the  analysis  map  of  Figure 
8.  Except  for  its  contents,  this  map  is  identical  in 


DeadEndRegression 


BivariateRegression 


Histogram  Histogram  ScatterPlot 
Figure  8:  Zoom  on  a  SubMap 

every  respect  to  that  of  Figure  5;  both  are  instances 
of  the  same  class. 

The  analyst  can  focus  attention  on  this  map, 
continuing  the  analysis  there,  without  affecting  the 
contents  of  any  other  map.  Indeed,  this  map  might 
also  contain  SubMaps,  each  representing  a  yet  finer 
sub-analysis.  These,  in  turn,  may  contain  others, 
and  so  on,  so  that  the  whole  analysis,  in  DINDE,  is 
actually  a  directed  network  whose  nodes  might  also 
be  networks,  each  one  representing  a  new  level  of 
detail. 

The  inverse  of  Zooming,  in  DINDE,  is 
Compression.  Middle-buttoning  on  the  title  bar  of 
any  analysis  map  produces  a  menu  whose  items 
correspond  to  operations  on  the  displayed  network. 
One  of  these  items  is  Compress.  Once  selected,  the 
user  is  required  to  identify  nodes  in  the  analysis 
(usually  by  mouse-selecting  them)  to  be  compressed 
into  a  single  SubMap.  All  of  the  relationships 
between  these  nodes  are  maintained,  so  that 
Zooming  on  the  new  SubMap  will  reproduce  the 
necessary  detail. 
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Figure  7:  Zoom  on  BivariateLeastSquares 


The  other  middle-button  menu  items  from  the 
title  bar  include  the  following:  AddAnalysisNode 
which  allows  the  user  to  select  a  new  step  from  the 
toolbox  and  attach  it  anywhere  in  the  network, 
MakeLink  and  BreakLink  wich  allow  the  user  to 
make  and  break  links  in  the  network  in  order  to 
make  the  analysis  easier  to  understand,  and, 
InterposeMemo,  which  allows  the  user  to  insert  a 
memo  between  two  nodes  in  the  network.  Together 
with  Compress,  these  network  tools  should  enable 
the  analyst  to  construct  an  analysis  of  arbitrary 
complexity,  whose  display  can  be  understood 
without  much  difficulty. 

It  is  not  necessary  to  select  each  new  step  in  the 
analysis  from  the  toolbox.  Middle-buttoning  on  a 
step,  already  in  the  analysis  map,  will  produce  a 
series  of  menus  that  contain  the  actions  that  step 
can  take.  The  consequence  of  many  of  these  actions 
is  a  new  step  that  is  attached  to  the  selected  one. 


7.0  Concluding  Remarks 

We  began  by  suggesting  that  there  are  three 
interrelated  objectives  that  one  might  have  for 
statistically  sophisticated  software:  Guidance, 
Management,  and  Strategy.  We  close  by  pointing 
out  how  DINDE  pays  some  attention  to  each  of 
these. 

The  guidance  in  DINDE  is  minimal  and  quite 
local  in  nature.  This  is  in  keeping  with  our  view  as 
to  what  sort  of  guidance  it  is  possible  to  competently 
give  in  a  statistical  analysis  (see  Oldford  and  Peters 
[1985a]  for  further  discussion). 

The  guidance  consists  of  the  identification  of  the 
useful  steps  in  an  analysis,  made  available  in  the 
toolbox,  and,  of  the  actions  and  suggestions  made 
available  via  menus  at  each  analysis  step.  At 
present,  the  guidance  is  data-independent,  in  the 
sense  that  it  does  not  depend  on  the  data  in  hand. 
This  does  not  rule  out  the  possibility  of 
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Figure  9:  Some  Menus  from  BivariateLeastSquares 


Figure  9  shows  some  of  the  menus  of  actions  that 
are  accessible  from  a  BivariateLeastSquares  step  (as 
in  Figure  5).  Here,  the  mouse  has  been  moved  to  the 
right  over  the  item  LocafMethods,  in  the  first  menu, 
and  over  the  item  PlotResiduals,  of  the  second  menu, 
to  display  a  menu  of  the  possible  residual  plots  that 
can  be  performed.  The  three  steps  QQGauss, 
ResidualVsFit,  and  ResidualScatter,  of  Figure  5, 
were  produced  by  selecting  from  this  menu  the  items 
QQPIot,  ResidualsVersusFit,  and  ResidualsVersusX, 
respectively. 

All  steps  in  the  analysis  have  access  to  menus 
that  make  sense  for  that  step,  and,  typically,  the 
analysis  is  constructed  by  selecting  items  from  these 
menus.  The  step  BivariateLeastSquares  was 
produced  by  selecting  the  item 
AddALeastSquaresLine  from  a  similar  set  of  menus 
on  the  Scatterplot  in  Figure  5. 

The  system  of  menus  available  at  each  step 
makes  the  analysis  easier  to  carry  out;  it  does  not 
restrict  it.  At  any  time,  the  analyst  has  access  to  a 
wide  range  of  possibilities.  For  example,  the 
analysis  can  proceed  from  any  step  in  the  network, 
including  SubMaps  and  their  contents,  not  just  from 
the  most  recent  one.  Alternatively,  a  new  step  can 
be  selected  from  the  toolbox  and  added  anywhere  in 
the  network.  Finally,  the  menus  can  be  ignored 
entirely  and  the  analysis  performed  in  the 
Interlisp(LOOPS  environment  using  DINDE  objects 
as  they  seem  useful.  DINDE  never  restricts  access 
to  the  underlying  programming  environment,  which 
means  that  the  powerful  tools  we  have  available  to 
us,  as  system  builders,  are  also  available  to  the  user, 
as  an  analyst. 


data-dependent  guidance,  provided  it  too  is  of  a 
quite  local  nature  (e.g.  noticing  collinearity  in 
regression). 

Management  of  the  data  analysis  is  made  easier 
in  DINDE  by  having  the  analyst  work  with 
identifiable  steps  within  a  network  metaphor  for  a 
statistical  analysis. 

With  identifiable  analysis  steps,  menus  can  be 
used  to  make  those  actions,  which  are  often  taken 
next,  immediately  available  to  the  analyst. 
Further,  notes  can  be  added  at  each  step,  or  inserted 
between  steps,  which  will  help  the  analyst  recall  the 
logic  of  the  analysis.  The  computing  environment 
also  allows  the  notes,  plots,  and  even  snapshots  of 
parts  of  the  analysis,  or  individual  analysis  steps,  to 
be  inserted  directly  into  a  report. 

The  network  paradigm  is  used  to  organize  the 
steps.  Much  of  this  organization  is  done 
automatically  at  each  step.  When  it  is  not,  tools  are 
available  to  organize  the  steps  as  the  analyst  sees 
fit.  These  tools  include  the  ability  to  Zoom  in  on 
analysis  steps,  to  yield  detail,  and,  to  Compress 
many  analysis  steps  into  a  single  SubMap,  to 
suppress  detail.  Arbitrarily  many  levels  of  detail 
are  thus  available  to  the  analyst.  Finally,  by 
actively  using  the  network  paradigm  during  the 
analysis,  as  opposed  to  after  the  analysis,  the 
analyst  is  encouraged  to  organize  the  analysis  as  it 
proceeds. 

As  regards  strategy,  DINDE  is  based  on  our 
present  view  of  what  constitutes  statistical  strategy, 
and,  on  how  that  strategy  can  be  fruitfully  studied 
with  software.  We  begin  with  a  simple,  yet  general, 
model  of  statistical  analysis:  the  directed  network. 


How  a  statistical  strategy  is  implemented  within 
this  framework  depends  on  the  level,  in  the  overall 
analysis,  at  which  it  is  expected  to  operate.  For 
example,  one  low-level  strategy  might  be  a  heuristic 
used  to  determine  the  outlying  points  in  a  plot, 
while  a  high-level  strategy  might  address  the 
organization  of  a  multiple  regression  analysis  (see 
Oldford  and  Peters  [1985ab]  for  further  discussion). 

High-level  statistical  strategy  involves 
specifying  the  basic  elements  of  the  network  model: 
the  nodes,  or  analysis  steps,  and,  the  links  which 
join  them.  In  DINDE,  the  steps  are  represented  as 
objects,  and  each  object  contains  a  specified  set  of 
actions  that  can  be  taken  at  that  step.  Thus,  the 
analyst  is  strategically  encouraged  to  link  the 
present  step  to  steps  which  result  from  taking  the 
offered  actions.  Even  stronger  encouragement  is 
provided  through  Suggestions,  an  action  which 
produces  canned  text  on  what  it  is  often  considered 
wise  to  do  first  at  this  step.  A  low-level  strategy  is 
more  likely  to  be  implemented  as  an  available 
action  at  some  step. 

Of  the  three  interrelated  objectives,  that  of  using 
software  to  study  the  strategies  of  practical 
statistical  analysis  has  been  our  primary  focus.  It  is 
hoped  that  software,  like  DINDE,  might  provide  a 
useful  tableau  on  which  statistical  strategies  can  be 
recorded,  and  hence  studied.  To  this  end,  we  see 
that  the  model  used  in  DINDE  can  be  improved  in 
two  ways. 

First,  the  fundamental  data  types  in  statistical 
practice  are  not  vectors  or  matrices;  these  are 
artifacts  of  the  mathematical  analysis.  More 
statistically  meaningful  are  records  on  individuals, 
batches  of  numbers  associated  with  a  variate,  and 
the  dataset  formed  by  combining  many  records  or 
many  batches.  We  are  currently  working  on 
implementing  such  statistical  objects  as  the 
fundamental  Data  objects  in  DINDE.  (This  in  no 
way  precludes  the  use  of  matrices  and  vectors  to 
carry  out  the  arithmetic.) 

Second,  the  directed  network  model  of  analysis  is 
not  quite  rich  enough  to  suit  our  purposes.  The 
difficulty  is  that  not  all  links  have  the  same 
meaning.  For  example,  consider  again  the  analysis 
map  of  Figure  5,  and  compare  the  links  between  the 
FloatVectors  and  the  Memo,  with  those  between  the 
BivariateLeastSquares  and  the  residual  plots. 
Should  the  first  set  of  links  be  considered  to  be  as 
strong  as  the  second?  The  first  set  were  made  by  the 
user  inserting  a  Memo,  while  the  second  set  were  a 
direct  consequence  of  actions  taken  at  the 
BivariateLeastSquares  step.  Suppose  one  would  like 
to  repeat  a  path  in  the  analysis,  with  different 
values  for  the  data  (a  sensitivity  analysis  perhaps), 
then  it  becomes  important  to  distinguish  between 
those  links  which  are  necessary  to  the  analysis  and 
those  which  are  merely  convenient.  With  the  simple 
network,  where  all  links  are  the  same,  this  is  not 
possible. 

In  closing,  we  feel  that  the  three  objectives  in 
Figure  1  can  be  fruitfully  pursued  together.  We  also 
feel  that  real  headway  can  be  made  on  any  of  these 
by  basing  the  software  on  a  model  of  statistical 
analysis  that  is  reasonably  accurate,  and,  natural  to 


use.  DINDE  is  one  such  attempt  that  is  currently 
based  on  a  simple  network  paradigm  for  statistical 
analysis.  As  we  improve  the  underlying  model, 
DINDE  will  come  closer  to  meeting  these  objectives. 
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Abstract 

The  Data  Viewer  is  a  system  for  the  exploratory  analysis  of  large,  high-dimensional 
datasets,  being  developed  on  a  Lisp  Machine.  Suppose  we  have  a  multivariate  dataset 
consisting  of  up  to  1000  observations  on  an  (arbitrarily  large)  number  of  quantitative 
variables,  how  can  we  examine  it?  The  data  viewer  tackles  this  problem  using  Grand 
Tour  techniques:  by  moving  projection  planes  it  displays  a  scatterplot  "movie".  Design 
issues  are  crucial  in  the  development  of  this  system,  in  particular  with  regard  to 
questions  of  user  interface.  The  Lisp  Machine  supports  object-oriented  programming 
and  the  use  of  constraints,  and  these  features  are  influential  in  our  implementation. 


Moving  Scatterplots 

Any  two  dimensional  orthogonal  projection  of  the  data  can 
be  displayed  as  a  scatterplot.  A  moving  scatterplot  results 
when  the  projection  plane  is  modified  every  fraction  of  a 
second.  If  this  is  done  as  often  as  10  times  per  second  then 
the  scatterplot  appears  to  be  in  continual  motion.  Clearly, 
this  means  that  a  user  may  see  more  data  views  in  a  shorter 
time  period;  whether  this  is  helpful  or  not  depends  on  how 
the  updating  is  done.  If  successive  plots  differ  substantially 
then  the  advantage  of  dynamic  plotting  is  lost:  a  user  has 
difficulty  assimilating  even  one  new  plot  a  second.  However, 
if  the  scatterplot  appears  to  move  in  a  continuous  manner 
then  we  gain  by  seeing  more  data,  and  by  seeing  it  move.  A 
static  plot  is  confined  to  displaying  two  dimensions,  whereas 
motion  presents  an  additional  two  dimensions  of  information, 
as  given  by  the  speed  vectors.  This  allows  a  perception  of 
depth  and  shape  of  the  pointcloud. 

Creating  Moving  Scatterplots 

A  moving  scatterplot  is  constructed  by  generating  a  family  of 
2-planes  called  views  fv,,r  >  0J,  where  t  represents  time.  At 
each  time,  the  corresponding  scatterplot  is  the  projection  of 
the  data  onto  v,.  As  t  is  increased  in  small  increments,  the 
view  v  is  updated,  and  the  new  point  coordinates  recomputed 
and  replotted.  In  order  that  the  scatterplot  appear  to  move 
smoothly  the  family  of  views  need  to  fullfill  certain 
requirements,  as  discussed  in  Asimov.  The  scheme  which  we 
have  successfully  implemented  operates  by  constructing  a 
sequence  of  views  /V,,  i  =  1 , 2...},  by  random  sampling,  for 
example.  Then  interpolation  is  done  between  successive 
views  to  obtain  {v,,t  2  0J.  Methods  have  been  derived  for 


interpolation  between  successive  views  such  that  the  illusion 
of  continuous  motion  is  preserved.  In  order  that  moving  plots 
be  a  technique  adaptable  to  a  wide  range  of  data  analytic 
situations,  the  data  viewer  provides  a  number  of  schemes  for 
generating  this  sequence  of  views,  and  allows  consideration 
to  be  reduced  to  more  specialized  sets  of  views. 

Specialized  Views 

The  most  general  kind  of  view  is  an  arbitrary  2-plane,  and 
the  corresponding  plot  is  the  linear  projection  of  the  p- 
dimensional  variable  space  on  to  this  plane.  This  means  that 
no  particular  orientation  of  the  plot  is  specified.  If  a  subset 
of  the  variables  is  of  particular  interest  to  the  user,  he  can 
indicate  this  by  classifying  these  variables  as  "active"  and  the 
remainder  "inactive".  Only  active  variables  can  have  non¬ 
zero  projection  coefficients,  so  this  achieves  a  reduction  in 
dimensionality  of  the  space  being  explored. 

Specialized  views  arise  when  the  data  to  be  displayed 
consists  of  two  disjoint  groups  of  variables,  of  sizes  p  and  q 
respectively,  described  by  the  matrices  Xnxp  and  YnM).  An 
example  of  this  is  the  canonical  correlation  situation.  Then 
the  views  of  interest  consist  of  linear  combinations  of  the  X 
variables  plotted  versus  linear  combinations  of  the  Y 
variables,  so  that  consideration  can  be  reduced  to  particular 
kinds  of  2-planes.  Notice  here  also  that  an  orientation  of  the 
plot  is  specified.  Views  of  this  type  are  obtained  when 
variables  are  classified  as  either  "X-variables",  "Y-variables" 
or  inactive,  so  that  a  variable  cannot  have  more  than  one  X 
or  Y  non- zero  projection  coefficient. 
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This  setup  has  a  number  of  important  special  cases.  Variable 
scatterplots  arise  when  p  =q  =  \.  Regression  also  fits  into 
this  framework,  where  there  is  one  response  (q  =  1)  and  p 
predictor  variables.  The  data-viewer  user  controls  the  type  of 
view  generated  by  attaching  classification  labels  to  each 
variable,  "X",  ”Y"  and  "A"  for  X,  Y  and  active  variables; 
otherwise  the  variable  is  rendered  inactive. 

Schemes  for  Generating  Views 

1.  Scanning 

A  brute  force  approach  to  examining  a  high  dimensional 
pointcloud  is  to  look  at  all  possible  projections  onto  planes. 
Schemes  for  scanning  high  dimensional  space  are  termed 
Grand  Tour  methods  (Asimov,  Buja).  One  such  method 
approximates  all  possible  data  views  by  sampling  planes 
uniformly  and  interpolating  between  them.  So,  with  some 
time  and  patience  a  user  may  come  arbitrarily  close  to  every 
2-d  data  projection.  We  have  used  this  scheme  successfully 
with  up  to  1000  data  points,  and  there  is  no  restriction  on  the 
number  of  variables.  In  practice,  human  vision  and 
understanding  limits  our  perception  to  3-d,  so  discerning 
higher  dimensional  structure  can  be  difficult. 

2.  Optimization 

We  use  projection  pursuit  techniques  to  augment  the 
scanning  procedure.  The  aim  is  to  avoid  views  that  look  like 
featureless  "blobs”,  concentrating  instead  on  projections 
showing  structure.  Commonly,  context  knowledge  and 
information  gained  from  the  analysis  to  date  dictate  how  the 
analysis  should  proceed,  so  that  exploration  is  done  with 
particular  questions  in  mind.  For  example,  are  there 
relationships  between  groups  of  variables,  are  outliers 
present  etc.  It  follows  that  he  wishes  to  see  views  that  help 
provide  answers  to  these  questions.  The  data  viewer  can  cater 
to  this  situation  by  displaying  a  movie  of  informative  views. 
The  user  selects  (or  defines)  a  measure  F  of  how  interesting 
a  view  is,  then  the  sequence  of  views  is  chosen  such  that 

F(vi)SF(v2)S .  Gradient  methods  are  used  to  obtain 

successive  views.  Suggestions  for  suitable  choices  of  F  are 
given  in  Friedman  &  Tukey,  and  Huber.  This  scheme 
corresponds  to  an  interactive,  real-time  optimization, 
interactive  because  both  the  starting  plane  and  stepsize  are 
under  direct  user  control.  This  has  the  advantage  that  typical 
sources  of  difficulty  with  optimization  methods,  for  example, 
the  presence  of  function  "flat  spots", are  no  longer  a  problem. 
Also,  since  our  emphasis  is  on  exploration,  finding  views 
which  are  local  maxima  suffices. 

3.  Other  Schemes 

Another  method  for  generating  a  view  sequence  permits 
exploration  in  a  neighborhood  of  the  current  view.  For 


example  ,  this  enables  us  to  discover  how  sensitive  the 
projected  plot  is  to  local  changes  in  the  optimizing  view. 
Provision  is  also  made  for  user  specified  views,  and  variable 
scatteiplots  are  particularly  easy  to  obtain. 

To  summarize,  the  essential  capability  of  the  data  viewer  is 
the  moving  scatterplot.  The  user  has  a  measure  of  control 
over  the  motion  via  a  choice  of  algorithms  for  generating 
views,  and  specification  of  the  type  of  view  to  be  generated. 
Other  data  viewing  tools  can  be  considered  in  one  of  three 
categories  . 

(i)  Tools  particular  to  the  data  viewer : 

For  example,  a  backtracking  capability,  which  provides 
the  ability  to  rewind  or  to  play  back  the  scatterplot 
movie,  is  useful.  We  would  also  like  the  ability  to  save 
away  views  for  retrieval  and  re-examination  at  a  later 
stage. 

(ii)  General  graphical  capabilities  : 

This  includes  multiple  plots  for  easy  comparisons, 
interactive  painting  (or  brushing)  and  point 
identification  utilities. 

(iii)  Data  related  tools  : 

This  includes  facilities  such  as  variable  scaling  and 
transformation,  and  subset  selection. 

Design  of  a  Data  Viewer 

As  outlined  above,  the  data  viewer  supports  a  high  degree  of 
functionality.  It  should  be  extendable  to  include  other 
statistical  methods,  for  example  new  view  generating 
schemes.  We  aim  towards  a  system  with  a  unified  graphical 
interface  to  simplify  user  communication.  These 
requirements  necessitate  a  careful  system  design.  The  data 
viewer  is  being  developed  on  a  Lisp  Machine,  which  has  the 
computational  power  necessary  for  the  demanding  task  of 
scatterplot  motion,  and  provides  an  environment  and 
language  features  that  encourage  experimentation 
(McDonald  &  Pedersen). 

Our  approach  to  the  design  of  the  data  viewer  is  to  factor  the 
system  into  various  components  and  sub-components.  Each 
component  has  its  own  specific  task  with  well-defined 
connections  or  interfaces  between  them.  This  design  can  aid 
both  system  programmer  and  user.  From  the  programmer’s 
point  of  view  it  is  highly  modular,  so  that  different 
components  can  be  implemented  independently  except  for 
the  specified  connections.  It  also  means  that  a  certain  amount 
of  flexibility  is  inherent  in  the  design;  components  can  be 
modified  internally,  for  example  a  new  view  generating 
scheme  added,  without  affecting  the  rest  of  the  system.  This 
design  helps  to  provide  a  coherent  user  model,  reducing  the 
initial  time  overhead  necessary  to  gain  familiarity  with  the 
system.  Also,  a  knowledge  of  the  various  components  and 
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their  interrelationships  means  that  the  capabilities  of  the 
system  are  better  understood;  its  power  and  flexibility  should 
be  transparent,  not  obscure  to  the  user. 

Object  Oriented  Programming  and  the  Data  Viewer 

The  Symbolics  Lisp  Machine  provides  the  "Flavors" 
extension  to  Lisp  for  a  style  of  programming  called  "object 
oriented  "  (Weinreb  and  Moon).  An  object  consists  of  data 
and  procedures  bundled  together.  (  In  Flavors  terminology 
these  are  termed  respectively  the  "instance  variables"  and 
"methods"  of  the  object.)  One  way  to  think  of  it  is  as  a  record 
in  Pascal  or  structure  in  S,  with  its  own  internal  procedures. 
The  object  oriented  programming  style  is  by  nature  modular, 
because  it  allows  a  data  structure  and  the  procedures  that 
operate  on  it  to  be  considered  a  single  entity.  The  Flavors 
system  gives  us  a  natural  way  to  factor  the  data  viewer  into 
components:  an  object  is  created  corresponding  to  each 
component. 

We  can  regard  the  data  viewer  as  an  object  for  displaying 
data  views.  Its  components  or  instance  variables  are: 
a  dataset 

a  "view-maker";  an  object  whose  task  it  is  to  supply 
views  or  projection  planes. 

a  plot;  a  window  object  which  shows  the  moving  point- 
cloud. 

The  data  viewer  has  a  "draw-plot"  method,  which  carries  out 
its  primary  task.  Draw-plot  applies  the  projection  from 
view-maker  to  the  dataset  and  shows  the  result  in  the  plot 
window.  This  is  a  one  sentence  explanation  of  the  operation 
of  data  viewer.  A  further  level  of  detail  is  got  by  considering 
the  data  viewer  components  and  their  respective  tasks. 

The  "view-maker" 

The  view-maker  object  generates  new  views,  so  it  has  a 
"new-view"  method  for  this  task.  In  order  to  obtain  a  new 
view  the  following  information  is  needed: 
the  current  view 

an  algorithm  (user  specified)  for  generating  views 
the  (user  specified)  classification  of  each  variable  i.e 
"X"  "Y"  "A"  or  "  "  (inactive) 

speed;  the  amount  by  which  the  view  is  incremented. 
These  items  are  represented  as  instance  variables  of  the 
view-maker.  The  new-view  method  updates  the  current  view 
by  the  amount  "speed”,  according  to  the  specified  algorithm 
and  variable  classifications. 

The  plot  window 

The  plot  has  two  major  components,  which  are: 

a  point-cloud;  the  projection  of  dataset  onto  the  current 
view 

variable-boxes;  for  each  variable  in  dataset  the 


following  information  is  displayed: 

-  the  variable  name 

-  the  coefficient  applied  to  this  variable  in  the  current 
view 

-  the  variable  classification  i.e.  "X"  "Y"  "A"  or  "  ", 

We  can  think  of  the  plot  window  as  being  a  mirror  of  the 
dataset  and  view-maker  objects.  This  we  refer  to  as  the 
"display  constraint".  The  point-cloud  reflects  the  dataset  and 
the  view  supplied  by  view-maker.  The  variable-boxes  display 
the  variable  names  from  the  dataset,  and  the  variable 
classifications  and  projection  coefficients  from  view-maker. 
This  mirror  analogy  for  the  plot  has  some  powerful 
consequences:  the  plot  always  shows  the  current  state  of  the 
underlying  objects.  So,  changes  to  the  dataset  or  view-maker 
are  immediately  reflected  on  the  screen.  Motion  results  as  a 
trivial  consequence  of  the  display  constraint;  changes  in  the 
view  supplied  by  view-maker  imply  a  corresponding  update 
of  the  displayed  point-cloud  and  projection  coefficients.  This 
means  that  except  for  the  display  constraint,  the  plot  window 
and  the  pair  of  objects  view-maker  and  dataset  operate 
independently.  Once  this  part  of  the  user  model  is  grasped, 
the  user  becomes  aware  of  some  interesting  data-analytic 
applications. 

Some  implications  of  this  design. 

A  typical  data  analysis  involves  the  comparison  of  two  or 
more  datasets,  for  example  males  and  females.  One  way  of 
doing  this  is  to  put  a  number  of  plots  each  in  its  own  window 
together  on  the  screen,  and  examine,  say,  height  plotted 
against  weight  for  each  set.  If  a  data  viewer  is  created 
corresponding  to  each  dataset,  and  these  viewers  have  a 
common  view-maker  object,  then  both  displays  will 
simultaneously  reflect  the  same  view-maker  object.  At  a 
particular  time  r,  one  display  shows  the  males  dataset 
projected  onto  the  corresponding  view  v,,  and  the  second 
display  shows  the  projection  of  the  female  dataset,  also  onto 
vt.  When  the  view  is  modified  by  view-maker,  both  displays 
update  themselves  to  reflect  this  change.  This  dynamic 
dataset  comparison  makes  it  possible  to  explore  questions 
such  as:  are  interesting  views  common  to  datasets?  We 
describe  these  data  viewers  as  being  linked  or  connected  by  a 
shared  view-maker  object. 

Similarly,  two  or  more  data  viewers  can  be  linked  via  a 
common  dataset  object,  each  with  its  own  view-maker. 
Suppose  the  dataset  object  contains  a  component  which 
defines  a  color  (or  any  drawing  symbol)  for  each  case.  The 
display  reflects  this  information  by  drawing  each  point  with 
the  appropriate  color.  Then  a  shared  dataset  object  means 
that  two  data  viewers  show  different  views  of  the  same  data, 
but  with  points  linked  by  color. 

For  further  discussion  on  the  applications  of  automatic 
updating  see  Stuetzle. 


A  scheme  for  user  interaction. 

How  can  user  interaction  be  explained  by  this  model? 
Suppose  a  user  wishes  to  identify  the  outlier  in  a  scatterplot. 
An  obvious  solution  is  to  make  use  of  the  pointing  device  (  a 
mouse)  to  indicate  the  particular  point,  and  have  the  case 
identification  label  appear  in  response.  We  interpret  this 
action  by  realizing  that  we  are  communicating  with 
underlying  dataset  "case"  via  its  plotted  representation,  a 
point  on  the  screen.  So,  the  role  of  the  display  is  extended 
from  being  simply  a  dumb  reflection,  to  provide  us  with 
communication  routes  to  the  constituent  objects. 
Communication  with  data  viewer  components  can  also 
include  modification.  For  example,  the  plot  window  has  a 
dial  or  gauge  which  shows  the  value  of  the  view-maker  speed 
component.  This  gauge  is  what  is  termed  "mouse  sensitive", 
so  that  it  can  respond  to  mouse  clicks  by  changing  its  value. 
In  fact,  what  changes  is  the  view-maker  speed  value,  and  the 
display  modifies  itself  to  remain  up  to  date.  A  similar  scheme 
can  be  used  for  changing  variable  classifications:  variable 
boxes  are  mouse  sensitive  and  provide  hooks  into  the 
variable  classification  quantities  which  are  part  of  the  view- 
maker.  Direct  manipulation  is  the  term  given  to  this  style  of 
interaction,  because  the  illusion  is  that  the  plotted 
representation  or  icon  is  actually  the  underlying  data  item, 
not  just  a  picture  of  it.  One  gets  the  feeling  that  the  plot 
window  is  somehow  alive,  and  has  knowledge,  so 
communication  can  be  more  expressive. 

Conclusion 

A  moving  scatterplot  is  both  an  efficient  and  informative 
method  for  displaying  high-dimensional  pointclouds.  The 
data  viewer  system  enables  this  technique  to  be  used  for 
purposes  of  exploration.  A  variety  of  view  generating 
schemes,  and  the  capability  to  specialize  views  mean  the  user 
can  adapt  the  system  to  the  data  analytic  task  at  hand.  In 
designing  the  data  viewer  we  aim  towards  a  coherent  user 
model;  this  can  aid  system  development  and  extension,  and 
more  significantly,  its  utility.  The  notions  of  object  oriented 
programming  provide  the  basis  for  our  design.  The  user 
communicates  with  the  system  via  directly  manipulating 
plotted  items.  These  features  combine  to  make  the  data 
viewer  good  for  a  large  range  of  approaches  to  exploration. 
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INTERACTIVE  COLOR  GRAPHIC  DISPLAY  OF  DATA  BY  3-D  BIPLOTS 
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I.  BIPLOTS.  This  paper  discusses  our  experience  in 
displaying  data  as  3-D  biplots  by  means  of  two 
alternative  graphics  devices,  (1)  standard  high 
resolution  graphics  terminals  with  attendant 
printer  plots  or  pen  plotters,  (2)  the  Raster 
Technologies  Model  ONE/20  frame  buffer  color 
device  with  attendant  black-and-white  laser 
printer. 

Biplots  (Gabriel,  1981a)  are  particularly  effective 
for  exploring  multivariate  data  matrices  and  for 
diagnosing  models  that  fit  these  data  or  subsets  of 
them.  A  3-D  biplot  displays  a  rank  three  approxi¬ 
mation  to  the  data  matrix  (which  is  usually 
centered)  by  means  of  row  markers  a[i],  i  =  1,...,n 
( =  the  number  of  rows  of  the  matrix)  and  column 
markers  b[j],  i  =  1,...,m  ( =  the  number  of  columns 
of  the  matrix).  The  lower  rank  approximation  may 
be  obtained  by  least  squares  or  by  resistant 
methods  (Gabriel  and  Odoroff,  1984a).  The 
markers  are  obtained  by  factorizing  the  approxi¬ 
mation  as  AB'  and  using  the  rows  of  A  and  B, 
respectively 

In  view  of  this  construction,  the  fundamental 
property  of  all  biplots  is  inner  product  representa¬ 
tion  of  the  data,  i.e.,  the  inner  product  of  any  a[i| 
and  any  b[i]  approximates  y[i,j],  the  corresponding 
element  of  the  data  matrix  (Gabriel,  1981  b). 

Special  types  of  biplots  further  approximate  the 
variances  and  covariances  of  the  columns  by  the 
configuration  of  of  the  column  markers,  and  the 
Mahalanobis-type  distances  between  the  rows  by 
the  biplot  distances  between  the  row  markers 
(Gabriel,  1971). 

A  particularly  useful  feature  of  biplots  is  that 
patterns  of  their  markers  may  be  used  to  diagnose 
the  type  of  model  that  would  fit  the  data,  e.g.,  if 
the  a[J's  were  all  on  one  line  and  the  b(]’s  on 
another  line  then  the  data  would  be  fitted  either 
by  an  additive  model  or  by  a  one-deqree-of- 
freedom-for-non-additivity  model,  depending  on 
whether  the  two  lines  were  perpendicular  to  each 
other  or  not.  A  limitation  of  biplots  is  that  they  are 
available  only  for  data  in  matrix  form.  However, 
higher-way  layouts  may  at  times  be  usefully 
biplotted  by  collapsing  them  into  a  two-way  table 
(Kester,  19/9;  Gabriel,  1981b;  an  example  will  be 
given  below). 

Biplots  can  display  data  only  as  well  as  the  data 
can  be  approximated  with  lower  rank.  Planar  (rank 
2)  biplots  are  often  very  useful,  but  3-D  biplots 
often  do  better  (Gabriel,  1981b).  We  expect  to  do 
even  better  in  higher  dimensions  and  are  studying 
the  application  of  Banchoff's  ideas  (1986)  for  the 
displays  of  4-D  biplots,  on  which  we  hope  to  report 
at  a  later  date. 

Our  3-D  biplots  are  produced  by  a  number  of 
techniques,  each  of  which  displays  row  markers  and 
column  markers,  and  allows  these  to  be  labelled  to 
indicate  the  particular  rows  and  columns  repre¬ 
sented. 

It.  DISPLAY  DEVICES  AND  SYSTEMS.  We  have 
routinely  produced  these  on  standard  graphics 
terminals  by  the  BGRAPH  system,  due  mostly  to 
Mike  Tsianco  (1981).  BGRAPH  produces  static 
displays  of  biplots  by  three  dimensional  perspective 


projections  of  a  viewing  cube  or  dodecahedron, 
and  by  using  various  depth  cues.  We  have  now 
implemented  the  ANIMATE  system,  due  principally 
to  Terry  Therneau,  which  adds  color  and  shape  cues 
and  is  semi-animated  in  that  it  simulates  the  effect 
of  depth  by  rapidly  displaying  successive  views  from 
slightly  different  angles  (Odoroff  et  al,  1986). 
"Rocking"  the  picture  back  and  forth  through  a 
series  of  about  6  such  views  may  create  an  impres¬ 
sion  of  three-dimensionality.  (We  distinguish  this 
from  a  truly  animated  system,  such  as  PRIM  or 
MACSPIN,  which  is  capable  of  calculating  the 
coordinates  of  any  number  of  views  and  displaying 
them  in  real  time.) 


TABLE  1:  FEATURES  OF  THE  BGRAPH  AND  OF  ANIMATE  SYSTEMS 


BGRAPH 

ANIMATE 

Markers 

Points  or  vectors 

Vectors,  squares,  circles 
or  spheres  with  options 
of  shading  and  color 

Labels 

Available 

Available 

Depth 

Perspective  cues 

Perspective  cues  on 

on  labels 

or 

marker  shapes 

Stereograms  (needs 

Rocking  through  views 

stereo  glasses) 

of  small  angular 

or 

Analglyphs  (need 

separation 

color  or  polarized 

Hiding  of  markers 

display  and  glasses) 

behind  other  markers 

Rotations 

Tt ansformations  to  any 

Transformation  to  any 

viewpoint  and  angle 

angle  (not  in  real 

(Not  in  realtime) 

time) 

Zoom  and 

Window 

Available 

Available 

Mainframe 

UNIX  VAX  11/750 

UNIX  VAX  11/750 

Display 

Graphics  terminal 

Raster  Technologies  ON6/20 

device 

(no  local 

24  frame  buffer  device 

computing) 

which  can  store  several 
pictures  simultaneously 
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photographs  of  screen 

Attributes  can  be  changed  and  displayed 
dynamically  via  keyboard  or  "mouse".  ANIMATE 
makes  use  of  certain  features  of  the  hardware,  such 
as  the  ability  to  rapidly  change  the  color  table,  and 
the  ability  to  rapidly  change  the  sector  of  the 
image  memory  displayed.  However,  it  is 
constrained  by  certain  features  of  the  hardware 
such  as  the  necessity  to  update  displays  from  the 


remote  host  computer.  We  should  like  to  ignore 
the  constraints  of  hardware,  but  they  will  remain 
with  us  until  the  next  generation  of  graphics 
workstations  are  available.  The  newer  graphics 
workstations  will  not  require  some  of  the 
commands  presently  needed  for  ANIMATE.  They 
will  circumvent  many  of  our  present  hardware 
limitations. 

A  command  language  to  manipulate  the 
graphics  display  is  implemented  in  yacc,  lex,  and 
RATFOR  on  a  VAX  1 1/750  with  the  UNIX  operating 
system.  The  displays  are  built  up  from  graphics 
primitives  provided  with  the  Raster  Technologies 
Model  ONE/20. 

The  basic  strategy  is  to  compute  the  display 
features  on  the  host  computer  using  the  principles 
outlined  in  Newman  and  Sproull  (1980).  A  set  of 
pictures  is  written  into  the  frame  buffer  memory, 
one  at  a  time  (about  ten  seconds  per  picture). 

When  the  host  has  completed  computing  all  the 
displays,  control  is  returned  to  the  frame  buffer 
and  the  pictures  are  displayed  in  rapid  succession  to 
simulate  motion 

The  Raster  Technologies  Model  ONE/20  is  a 
frame  buffer  device  which  displays  a  picture  of  5 1 2 
by  512  pixels  on  a  color  video  monitor.  Each  pixel  is 
addressed  by  a  24  bit  word.  The  24-bit  word  can 
partitioned  flexibly  to  trade  more  pictures  for 
fewer  colors.  By  using  hardware  pan  and  zoom 
functions  additional  pictures  can  be  obtained  by 
sacrificing  picture  resolution.  By  these  artificial 
means,  ANIMATE  can  store  either  192  pictures  in 
four  colors  or  3  pictures  in  256  colors.  Colors  are 
chosen  from  a  palette  of  more  than  16  million. 

The  frame  buffer  device  provides  the  ability  to 
use  a  large  palette  of  colors  in  displays  and  since  it 
is  a  raster  device,  motion  can  be  simulated  by 
animation.  The  host  computer  computes  the 
attributes  of  the  display  and  communicates  via  a 
9600  baud  communication  link  to  the  frame  buffer 
device.  As  noted,  up  to  192  pictures  may  be  written 
into  the  frame  buffer  memory  before  control  is 
handed  to  the  display  processor.  The  system  is  thus 
limited  in  the  rapidity  with  which  a  display  can  be 
changed  by  the  speed  of  the  host  computer  and  the 
communication  line.  Modifications  of  color ,  choice 
of  the  sector  of  frame  buffer  displayed  are  rapid; 
modifications  requiring  the  host  to  compute  a  new 
picture  are  slow.  When  more  local  processing 
power  is  available  this  problem  will  be  solved. 

We  have  found  that  6-24  pictures  displayed  at 
ten  frames  per  second  in  the  frame  buffer  are 
adequate  to  simulate  motion.  We  also  have  avail¬ 
able  the  use  of  perspective,  eclipsing  of  points  by 
their  neighbors,  and  the  representation  of  points  as 
illuminated  spheres  to  aid  in  the  simulation  of 
depth. 

III.  SOME  EXAMPLES.  We  now  discuss  several  sets 
of  examples,  some  of  real  data  and  some  of  artifi¬ 
cially  generated  matrices.  Each  example  told  us 
something  about  the  graphics  systems  and  the  use¬ 
fulness  of  their  various  features.  At  the  Interface 
Symposium  we  accompanied  our  paper  with  a 
series  of  color  slide  photographs  of  displays  on  the 
ONE/20  screen.  We  tried  to  imitate  the  rocking  of 
views  on  the  device  by  rapidly  flipping  back  and 
forth  through  a  set  of  slides  in  a  projector  carousel  - 
-  we  hope  tnis  gave  a  sufficiently  clear  impression 
of  the  capabilities  of  the  ANIMATE  system.  We 
cannot  hope  to  achieve  the  same  impression  in  this 
black  and  white  printed  report,  so  we  show  only  a 
small  number  of  laser  printer  reproductions  of 


some  of  the  slide  views.  In  view  of  these  difficulties 
of  reproducing  the  semi-animated  color  display, 
the  discussion  of  the  examples  will  be  very  brief 
and  the  conclusions  which  we  state  will  have  to  be 
taken  on  trust  until  the  reader  has  an  opportunity 
to  see  a  demonstration  of  the  ANIMATE  system. 

Ilia.  THE  IRIS  DATA.  For  the  well-known  Anderson 
(1935)  Iris  data,  we  have  a  1 50-by-4  matrix,  the  four 
columns  being  allocated  to  the  variables  of  petal 
length  and  width  and  sepal  length  and  width,  the 
1 50  rows  to  the  1 50  Iris  flowers,  the  first  fifty  rows 
to  I.  setosa  flowers,  the  next  to  /.  versicolor,  the  last 
fifty  to  /.  virginica. 

The  biplot  of  these  data,  after  variable  means 
had  been  subtracted,  is  shown  in  Figure  1  with 
vectors  for  the  four  variables'  markers  and  indi¬ 
vidual  Iris  markers  being  circles  of  different  shad¬ 
ings  for  /.  setosa  (the  leftmost  scatter),  /.  versicolor 
(the  middle  scatter)  and  /.  virginica  (the  rightmost 
scatter).  The  configuration  of  the  variables  shows 
petal  width  and  length  to  be  very  highly  correlated 
with  each  other  ana  highly  with  sepal  length,  but 
sepal  width  is  not  well  correlated  with  any  of  them; 
the  scatter  of  the  flowers  is  seen  to  consist  of  clearly 
distinct  clusters  for  the  three  species;  the  separa¬ 
tion  of  the  species'  clusters  is  along  the  direction  of 
the  first  three  variables,  showing  that  /.  setosa  is 
the  smallest  species  on  these  measures,  I.  versicolor 
is  larger  and  I.  virginica  somewhat  larger,  though 
the  latter  two  are  not  as  well  separated  from  each 
other  as  they  are  from  I  setosa. 

Rotation  of  the  biplot  to  another  view  empha¬ 
sizes  the  species  separation  and  shows  I.  setosa  to 
be  less  variable  than  the  other  species.  It  also 
suggests  that  two  outliers  existed  and  points  to  a 
surprising  crescent-shaped  distribution  of  / 
virginica. 

WHAT  WE  HAVE  LEARNED  FROM  THIS 
EXAMPLE.  In  exploring  unlabelled  data,  we  looked 
for  shapes  of  distributions,  for  clusters  and  their 
separation,  and  for  outliers.  Graphical  exploration 
of  this  kind  was  helped  by  the  following  features: 

COLOR  was  very  useful,  as  was  SHAPE; 
(PERSPECTIVE  was  not  tried  on  this  example). 
ROTATION  was  crucial,  and  it  needed  to  be  fast  to 
help  in  exploratory  analysis.  ROCKING  did  not  add 
much  to  feeling  or  depth,  HIDING  did  a  little. 

It  seemed  more  important  to  be  able  to  move  a 
viewing  plane  through  space  than  to  get  a  feeling 
of  depth,  of  space.  WHY?  Are  the  tools 
inadequate?  Or  are  our  questions  essentially  one 
and  two-dimensional? 

Study  of  covariance  configuration  needed 
LABELS.  The  lack  of  clear  DEPTH  cues  hampered  it 
more  than  it  had  hampered  the  study  of  scatters. 
Does  this  indicate  that  higher  dimensional  space  is 
more  important  in  studying  configurations  than 
scatters?  Are  we  able  to  imagine  covariance 
configurations  spatially,  but  distributions  planarly? 
Or  is  the  effectiveness  of  more  dimensions  merely  a 
result  of  the  relative  sizes  of  the  collection  of  units 
(150)  and  variables  (4).  Can  it  be  that  we  can 
visualize  a  few  objects  in  higher  dimensional  space 
than  a  larger  set  of  objects? 

Illb.  TWO  ARTIFICIAL  20-BY-1 5  DATA  MATRICES. 
Figures  3  and  4  are  biplots  of  two  20-by-1 5  matrices 
of  data,  generated  from  certain  models  (Gabriel 
andOdoroff,  1986a, b,c).  Each  has  20  column 
markers  (dark  spheres),  and  1 5  row  markers  (light 
spheres). 

No  pattern  is  evident  on  Figure  3,  but  Figure  4 
reveals  a  clear  pattern,  the  column  markers 
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appearing  to  be  on  a  an  oblique  plane.  Comparing 
the  two  figures  it  is  clear  that  it  is  use  of  perspective 
in  Figure  4  that  makes  its  pattern  apparent.  The 
pattern  that  generated  Figure  3  is  not  at  all 
apparent  because  that  biplot  was  displayed 
without  perspective. 

Suitable  rotations  would  show  both  biplots  to 
have  the  row  markers  on  one  plane  and  the  column 
markers  on  another.  In  Figure  3  the  two  planes  are 
perpendicular,  in  Figure  4  they  are  not.  It  has  been 
shown  (Chuang,  Gabriel  and  Therneau,  1984)  that 
this  diagnoses  models  r[i]  +  v[j]  +  t[i]w[jj  and 
Ft  +  s[i]v[j]  +  tfijwfjl,  respectively. 

WHAT  WE  HAVE  LEARNED  FROM  THESE 
EXAMPLES.  It  is  possible  to  MODEL  data  by  dis¬ 
cerning  patterns  of  biplot  markers.  Display  by 
means  of  SHADED  SPHERES  is  effective  when 
PERSPECTIVE  cues  are  used.  This  was  a  major  aid  in 
discerning  patterns.  Easy  and  rapid  ROTATION  and 
ROCKING  are  also  of  help. 

Ilk.  AN  EXPERIMENT  ON  SOLAR  WATER  HEATING 
SYSTEMS.  The  last  example  uses  the  results  of  a 
four  factor  experiment  (Close,  1967,  quoted  in  Box, 
Hunter  and  Hunter,  1978).  The  data  appeared 
naturally  in  a  four-way  layout  -  high-low  levels  of  I 
(insolation),  high-low  levels  of  S  (size  of  tank),  of  W 
(water  flow)  and  of  D  (discontinuity,  or  intermit¬ 
tent  of  sunshine).  To  biplot  it,  we  arranged  it  in 
matrix  form  with  factors  I  and  S  cross-classified  in 
the  four  rows,  and  factors  W  and  D  cross  classified 
in  the  columns.  Before  biplotting,  the  data  were 
centered  on  the  overall  mean. 

The  biplot  of  these  data  -  Figure  5  -  has  four 
row  markers  (dark  spheres),  labelled  is,  iz,  js,  and  jz, 
for  all  combinations  of  the  upper  and  lower  levels  i 
and  j  of  factor  I  and  upper  and  lower  levels  s  and  z 
of  factor  S.  Also,  it  had  four  column  markers  (light 
spheres),  labelled  wd,  wt,  vd,  and  vt,  for  all  combin¬ 
ations  of  the  upper  and  lower  levels  w  and  v  of 
factor  W  and  upper  and  lower  levels  d  and  t  of 
factor  D.  Inspection  of  this  biplot,  and  of  suitable 
rotations,  allowed  diagnosis  of  all  the  main  effects 
and  of  the  interactions  between  I  and  S  (no  inter¬ 
action)  and  between  W  and  D  (which  did  interact). 
(For  a  detailed  discussion  of  the  logic  of  these 
diagnostics  see  Kester,  1979  and  Chuang,  Gabriel 
ana  Therneau,  1984,  as  well  as  Gabriel  and 
Odoroff,  1986a,b,c.) 

To  inspect  the  IW,  ID,  SW,  SD  interactions  we 
required  markers  for  each  factor  averaged  over  the 
levels  of  the  other  factor  of  the  pair.  Construction 
of  these  averages  on  the  biplot  was  therefore  an 
important  diagnostic  tool.  The  resulting  biplot  — 
Figure  6  -  has  the  markers  for  the  averages  dis¬ 
played.  Thus,  for  example,  the  average  for  d,  the 
upper  level  of  factor  D,  is  displayed  by  label  *d 
which  is  located  midway  between  markers  wd  and 
vd.  Similarly,  halfway  between  vd  and  vt,  one 
should  find  the  v*  label  which  marks  the  upper 
level  of  factor  V,  but  the  v*  marker  is  partly 
obscured  by  the  sphere  marker  for  is. 

We  now  look  at  the  (*d,*t)  and  the  (w*,v*) 
directions,  relative  to  the  (i*,j*)  and  (*s,*z) 
directions.  The  (*d,*t)  direction  is  found  to  be 
perpendicular  to  the  IS-plane,  and  this  allows  us  to 
conclude  that  D  does  not  interact  with  I  or  S.  On 
the  other  hand,  upon  rotating  the  biplot  about  the 
X-axis  we  find  the  (w*,v*)  direction  to  be  oblique  to 
the  IS-plane  and  so  conclude  that  W  does  interact 
with  I  and/or  S. 

WHAT  WE  HAVE  LEARNED:  We  cannot  only 
model,  but  also  observe  effects,  additivity  and 


interaction  in  higher  way  layouts.  To  do  so  we 
obviously  need  LABELS  and  ROTATION.  But  a  sense 
of  DEPTH  has  also  been  needed.  We  found  it  most 
strongly  suggested  by  the  PERSPECTIVE  cues  on  the 
SHADED  SPHERES;  ROCKING  was  less  useful 
because  the  labels  tend  to  iump  about  and  make 
identification  of  individual  markers  very  difficult. 

Another  most  important  help  was  the  use  of 
AVERAGE  MARKERS.  This  allowed  us  to  disen¬ 
tangle  the  effects  of  single  factors  which  had 
originally  been  cross-classified  with  other  factors. 
More  generally,  we  think  an  important  diagnostic 
aid  to  a  graphics  system  is  the  ability  to  FIT  MODELS 
ON  THE  PLOT  itself.  For  example,  if  a  set  of  markers 
is  thought  to  lie  on  a  plane,  or  line,  it  is  extremely 
useful  to  be  able  to  fit  and  display  such  a  plane,  or 
line,  on  the  plot  itself.  Our  system  cannot  do  that 
yet. 

IV.  SUMMARY.  We  can  MODEL  by  using  biplot 
graphics,  and  we  can  also  EXPLORE  data. 

The  requirements  for  effective  graphics  are  first 
of  all  DISPLAYS  THAT  ATTRACT  THE  EYE  !!  This  is 
furthered  by  the  use  of  COLOR,  SHADING,  and 
SHAPES. 

DEPTH  is  also  important  for  inspection  of 
biplots,  and  it  is  most  effectively  simulated  by 
PERSPECTIVE  rather  than  by  MOTION.  That,  at  any 
rate,  is  the  impression  we  got  from  using  the 
BGRAPH  and  ANIMATE  systems. 

ANCILLARIES  that  we  consider  essential  for  any 
graphics  system  that  is  to  be  used  for  data  analysis 
are:  LABELS  !!f,  ROTATION  !•  and  CAPABILITY  TO 
MODEL  ON  THE  DISPLAY! 
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ABSTRACT 

We  have  developed  a  Bayesian  approach  to  the  problem 
of  deciding  which  subset  of  a  proposed  set  of  p  predictor 
variables  to  include  in  a  linear  regression  model  that  is  to  be 
used  for  prediction.  The  direct  implementation  of  this 
method  requires  the  computation  of  the  usual  regression 
statistics  for  each  of  the  V  possible  submodels.  We  have 
developed  a  branch  and  bound  method  which  yields  the 
same  results  much  more  quickly  by  eliminating  from  con¬ 
sideration  those  submodels  which  are  destined  to  have  negli¬ 
gible  posterior  probability.  Implementation  of  the  algorithm 
on  the  Cray  X-MP  supercomputer  is  discussed. 

1.  Introduction 

Our  setting  is  that  of  standard  linear  regression.  There 
are  n  observations  on  p  predictors  xljc2-  4  4  '  Jtf  and  one 
dependent  variable  y.  We  shall  assume  the  first  order 
model: 

y  =  0o+  £  P)*j  +€  . 

J=i 

where  €  is  normally  distributed  with  mean  0  and  variance 
<r2,  and  all  cases  (runs)  are  independent. 

At  some  point  during  the  statistical  analysis,  one  may  be 
interested  in  the  possibility  of  omitting  some  predictors 
from  the  model.  The  search  for  a  "best"  submodel  (or  set  of 
submodels)  is  called  variable  selection  or  subset  selection.  It 
is  undertaken  for  a  number  of  possible  reasons:  (1)  to 
express  the  relationship  between  y  and  the  predictors  as 
simply  as  possible.  (2)  to  reduce  future  cost  of  prediction. 

(3)  to  identify  "important*  and  "negligible*  predictors,  or 

(4)  to  increase  the  precision  of  statistical  estimates  and 
predictions. 

We  have  developed  a  Bayesian  approach  to  the  problem 
of  variable  selection,  the  details  of  which  will  be  submitted 
for  publication  elsewhere.  In  the  next  section,  we  summar¬ 
ize  this  approach. 

2.  The  Bayesian  Model 

We  assume  that  the  predictors  have  been  "suitably 
scaled."  and  shall  avoid  discussing  this  aspect  any  further 
here.  In  addition,  we  shall  let  the  P,' s.  y  =1,2,  ■  ■  ,/> ,  be 
independently  and  identically  distributed  a  priori,  where 
the  form  of  the  common  marginal  distributions  is  similar  to 
that  of  Box  and  Meyer  [1986],  i.e., 

F(0y*O)  =  /., 

Pip,  <b,P,*Q)=*hi  (b  +  /).  -f<b<f 
POP,  I  >  /  )  =  0  , 

where  h2  >  0.  h2  >  0.  and  2h,f  +  h2  m  1.  Tb>s  is  a 
spike  and  slab*  distribution,  i.e..  a  mixture  of  a  uniform 
distribution  over  the  interval  (— /  ,/  )  and  a  distribution 
with  all  its  probability  mass  at  0.  We  shall  take  /  and  y  as 
the  parameters  of  this  distribution,  where 

yhi/h,  . 

i  e..  y  is  the  height  of  the  spike  divided  by  the  height  of  the 
slab.  We  consider  y  to  be  a  measure  of  one's  prior  inclina¬ 
tion  to  omit  any  predictor  from  the  model.  We  shall  treat  it 
as  an  adjustable  parameter  of  the  Bayes  model,  that  is.  we 
shall  not  assign  a  distribution  to  it.  However,  we  shall 


assign  to  <r  the  standard  "noninformative*  prior.  i.e..  In  (7  is 
locally  uniform,  independent  of  the  p’s.  We  shall  also  take 
/  to  be  arbitrarily  large. 

We  are  interested  here  in  the  posterior  probabilities  of 
the  submodels,  where  each  submodel  is  identified  by  the 
requirement  that  each  member  of  a  particular  subset  of  the 
P's  is  0. 

It  can  be  shown  that  the  posterior  probability  of  the  ma> 
submodel  is 


where  the  logarithm  of  the  "weight"  wm  is  given  by 

ln(wm  )  =  km  (-ln(y)  +  %ln(ir))  +  InT((n  — k„  )/2) 

-VilnlVml  —  %0i-*m)lnS„2 

where  km  is  the  number  of  terms  in  submodel  m .  V„  is  <r~2 
times  the  variance-covariance  matrix  of  the  least  squares 
estimates  of  the  p's  omitted  by  submodel  m .  and  S£  is  the 
residual  sum  of  squares  for  submodel  m . 

From  this  posterior  distribution,  one  can  compute  and 
plot  various  quantities  of  interest  as  functions  of  y.  e.g..  the 
posterior  probability  that  each  p,  is  0.  A  useful  way  of 
assessing  y  is  to  plot  the  posterior  probability  of  goodness  of 
fit  as  a  function  of  y:  this  is  the  sum  of  the  posterior  proba¬ 
bilities  at  all  submodels  that  pass  a  standard  F-test  for 
goodness  of  fit  at  a  specified  level  of  significance. 

3.  Computations 

In  principle,  this  kind  of  analysis  requires  the  computa¬ 
tion  of  Sjt  and  I  Vm  I  for  all  2r  submodels.  Although  there 
are  efficient  methods  for  doing  all  possible  regressions  [Fur- 
nival  and  Wilson.  1974],  we  really  need  only  those  submo¬ 
dels  that  have  non-negligible  posterior  probability. 

We  define  a  "negligible*  posterior  probability  as  follows. 
Let  m*  refer  to  the  best  submodel,  i.e..  the  one  with  max¬ 
imum  posterior  probability.  If  Pm/P„e  <  10-4.  say.  then 
Pm  is  negligible.  Even  though  one  can  conceive  of  situations 
in  which  assigning  posterior  probability  of  0  to  all  sub¬ 
models  classified  as  negligible  by  this  definition  will  not 
result  in  a  good  approximation  to  the  true  posterior  distri¬ 
bution,  this  is  what  we  shall  do.  Our  rationale  is  that  such 
submodels  would  not  have  practical  interest,  since  a  much 
more  acceptable  alternative  submodel  exists. 

We  have  developed  a  branch  and  bound  algorithm  that 
finds  the  weights  for  all  non-negligible  submodels.  First,  a 
forward  selection  routine  is  used  to  find  a  reasonably  good 
submodel  of  m  that  we  can  use  as  a  standard.  (We  would 
use  m*  if  we  could,  but  we  don't  know  it  until  the  algo¬ 
rithm  is  finished.)  We  then  define  a  cutoff  value  c  for 
ln(wm  ).  where 

c  =  lnOv^-lnUO4)  . 

The  branch  and  bound  algorithm  finds  all  m  and  wm  such 
that  ln(w„ )  exceeds  c .  This  catches  all  the  non-negligible 
submodels  (by  the  definition  above)  plus  a  few  others, 
which  can  then  be  weeded  out. 

The  algorithm  is  based  on  a  tree  of  nodes,  where  each 
node  is  a  collection  of  submodels.  The  root  node  consists  of 
all  submodels.  All  other  nodes  are  characterized  by  a  set  of 
predictors  that  are  in  all  submodels  in  that  node  and 


another  set  of  predictors  that  are  excluded  from  all  sub-  with  many  (>  50.  say)  predictors.  At  present,  the  largest 

models  in  that  node.  By  computing  the  properties  and  cases  we  have  tried  have  25  predictors:  this  takes  roughly 

I  Vm  I  of  the  largest  and  smallest  submodels  in  a  node,  one  one  second  of  CPU  time  on  the  Cray.  Future  timing  studies 

can  find  upper  and  lower  bounds  on  the  posterior  probabili-  will  need  to  take  account  of  the  fact  that  the  computing 

ties  of  all  submodels  in  the  node,  as  functions  of  km .  If  a  time  depends  on  the  value  of  y.  since  for  some  values  of  y 

node  is  found  to  be  "bad,"  in  the  sense  that  none  of  its  the  number  of  non-negligible  submodels  is  considerable, 

members  have  ln(wm)  that  can  exceed  c.  the  node  is  not  We  have  also  begun  to  modify  our  Fortran  program  to 

considered  further.  Otherwise,  the  node  is  "split"  into  two  explicitly  utilize  the  multitasking  capability  of  the  Cray  X- 

daughter  nodes,  where  assignment  of  a  submodel  to  one  or  MP,  which  has  two  processors, 

the  other  of  the  two  daughters  is  made  on  the  basis  of  the 
presence  or  absence  of  a  specified  predictor.  A  heuristic 

choice  of  predictor  is  made  for  this  purpose,  the  idea  being  REFERENCES 

to  choose  an  apparently  important  one. 
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SIMULATION  OF  RADAR  AND  SURFACE  MEASUREMENTS  OF  RAINFALL 
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Introduction 

The  radar  remote  measurement  of  rain  Intensity 
is  a  problem  of  continuing  interest  to  radar 
meteorologists.  For  example,  the  remote  moni¬ 
toring  of  flash-flood  producing  storms  is  an 
important  applications  area.  Even  though 
considerable  progress  has  been  made  in  the 
development  of  new  radar  measurement  techniques, 
e.g.,  using  dual-polarized,  dual- frequency  or 
differential  phase  shift,  the  problems  associated 
with  the  error  structure  of  remotely  sensed 
precipitation  estimates  appear  to  have  received 
little  attention  so  far  in  precipitation  research, 
AGU  (1984). 

A  pulsed,  meteorological  radar  illuminates  a 
radar  resolution  volume  which  depends  on  range, 
antenna  pattern  and  pulse  width;  typically  the 
volume  is  approximately  0. 1  km'*  for  a  range  of  50  km, 
1°  beam  width  and  1  p sec  pulse,  Doviak  and  Zrnic 
(1984).  Uithin  this  volume,  hydrometeors  are 
assumed  to  be  randomly  positioned,  and  constitute  a 
random  medium  from  which  radar  measurements  are 
obtained.  Furthermore,  the  fractional  volume 
concentration  of  the  scatterers  is  generally  very 
small  (<<  1H)  so  that  the  Independent  backscatter 
approximation  is  valid.  Hence,  the  average 
backscattered  power  (or  reflectivity)  is 
proportional  to  the  (incoherent)  sum  of  powers 
backscattered  by  each  particle  within  the 
resolution  volume.  Statistical  fluctuations  of 
the  received  power  are  related  to  the  Doppler 
velocity  spectrum  of  the  particles  within  the 
resolution  volume.  Conventional  Doppler  radars 
measure  the  mean  power  (or  reflectivity,  Z)  as  well 
as  the  first  moment  of  the  Doppler  spectrum,  i.e., 
the  mean  velocity.  Rainfall  rate,  or  the  vertical 
flux  of  raindrops  contained  within  the  resolution 
volume,  is  conventionally  related  to  Z  by  power  law 
equations  of  the  form  Z  »  aR  ,  where  a  and  b  depend  on 
the  unknown  raindrop  size  distribution  (RSD), 
Ulbrich  (1983)  .  The  radar  measured  mean  power  (P) , 
an  electromagnetic  signal,  is  related  to  Z_(a 
quantity  of  meteorological  significance)  by  P  ■ 
CZ/rl  where  C  is  the  radar  constant  and  r  is  the 
range  to  the  resolution  volume.  The  parameters  a 
and  b  of  the  Z-R  relation  are  generally  estimated  by 
comparing  radar  measurements  of  reflectivity  (Z) 
with  surface  Instruments  such  as  raingages  (which 
estimate  rain  intensity)  or  raindrop  size  measuring 
devices  (or  disdrometers)  which  estimate  the  RSD. 
These  surface  devices  have  extremely  small  sampling 
volumes  (0.1  to  1  nr)  compared  to  the  radar  sampling 
volume.  Thus,  even  under  ideal  conditions,  it  is 
difficult  to  separate  statistical  fluctuations 
from  fluctuations  caused  bv  physical  processes 
(e.g.,  the  changing  RSD)  with  respect  to  the 
relationship  between  radar  Z  and  surface  measured 
R. 

In  order  to  overcome  the  problems  associated 
with  acquiring  simultaneous  radar/surface  rair  fall 
data,  radar  meteorologists  often  measure  the  RSD 
(either  at  the  surface,  or  in-situ,  using 
instrumented  aircraft)  or  approximate  the  RSD  by 
some  functional  form,  and  calculate  both 
reflectivity  (ZRgp)  and  rainfall  rate  from  the  size 
spectrum.  We  use  the  subscript  RSD  on  Z  to  denote 
that  it  is  calculated  based  on  a  measured  RSD  or  on 


an  assumed  form  for  the  RSD.  Hence,  it  is  possible 
to  relate  2Rgp  and  rain  intensity,  and  to  translate 
these  relationships  to  radar  measured  Z  versus  rain 
intensity.  However,  statistical  correlations 
between  ZRgp  and  R  must  be  understood  before  such 
deductions  can  be  made.  If  the  RSD  is  approximated 
by  a  gamma  distribution,  the  three  parameters  of  the 
gamma  distribution  namely,  particle  number 
density,  shape  parameter  and  scale  parameter,  are 
generally  unknown  but  are  often  estimated  using 
measured  RSDs  via  moment  methods  or  MLEs,  Ulbrich 
(1983) ,  Mi e Ike  (1976)  ,  Wong  and  Chidambaram  ( 1985)  . 
Again,  statistical  correlations  between  the 
various  moments  of  the  RSD  must  be  understood  before 
deductions  regarding  the  physical  fluctuation  of 
the  gamma  parameters  can  he  made. 

Our  paper  is  organized  as  follows:  Section  1 
describes  the  RSD  model  and  intercomparison  between 
radar  reflectivity  and  rainfall  intensity. 
Section  2  considers  the  theoretical  correlation 
between  ZRg-  and  rain  intensity  (as  well  as  other 
RSD  moments)  for  a  gamma  RSD.  Section  3  considers 
the  statistical  fluctuations  inherent  in  the  radar 
measurements  of  Z.  Simulation  methods  for  radar  Z, 
and  gamma  RSDs  (including  moments)  are  given  in 
Section  4  along  with  a  description  of  computational 
complexity  and  the  use  of  the  CSU/Cyber  205 
supercomputer.  Section  5  describes  the  results  of 
our  simulations  and  shows  examples  of  a  few  possibly 
incorrect  deductions  about  physical  processes  that 
have  been  made  which  can  be  accounted  for  by 
statistical  fluctuations  only. 

1 .  Raindrop  Size  Distribution  (RSD). 

The  space-time  evolution  of  the  raindrop  size 
distribution  (RSD)  is  typically  due  to  a  variety  of 
physical  processes,  e.g.,  evaporation,  collision- 
coalescence,  collisions!  breakup,  drop  sorting, 
etc.  Both  cloud  models  and  measurements  of  RSDs  at 
the  surface  show  that  a  gamma  RSD  can  account  for 
many  of  the  natural  variations  in  the  RSD,  Ulbrich 
(1983): 


N(D)  -  N0  Dw  exp(-AD) 


(1) 


where  N(D)  is  the  number  of  raindrops  per  unit 
volume  per  unit  size  interval  (D  to  IH-  ID).  In 
terms  of  the  conventional  gamma  pdf,  N(D)  can  be 
written  in  the  equivalent  form, 


N(D) 


'  T  _  | 

— -  D 1  exp  (-D/t) 

r(..)r  ’ 


(2) 


where  i>0,  ?j>0,  D20.  We  note  that  0  r(a)p'1 

m  *  a-1;  A  ■  g.  A  physically  meaningful  parameter 
known  as  the  median  volume  diameter  Dq  can  be 
defined  by. 
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where  Dq  is  such  that  all  drops  with  diameter  <  Dq 
contribute  to  one  half  the  total  liquid  water 
content.  Ulbrich  (1983)  has  shown  that  ADp~ 
3.67+m.  Reflectivity  (ZRgp)  and  rain  intensity 


can  be  formulated  as  various  moments  of  N(D): 


ZRSD  “ 
R 


/  D6  N(D)  dD  mm6m'3 
0 

1  f  D  v(D)N(D)  dD  mm  hr 
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(4) 

(5) 


where  v(D)  “  CD3’^3,  Is  the  raindrop  still  air  fall 
speed. 

RSDs  can  be  estimated  by  using  surface 
Instruments,  e.g.,  dlsdrometers  (drop  size 
meters),  or  using  probes  mounted  on  instrumented 
aircraft.  The  Joss-Waldvogel  (1967)  disdrometer 
is  a  momentum  device  with  a  sensor  area  of  50  cm  , 
and  estimates  N(D)  for  D  in  the  range  0.5  mm-5  inn 
with  typical  integration  times  of  30  sec  -  1  min.  A 
number  of  authors  have  used  N(D)  data  from  Joss- 
Waldvogel  dlsdrometers,  Ulbrich  (1983),  Joss  and 
Gori  (1978),  Bring!  et  al.  (1982),  Goddard  et  al. 
(1982). 

To  define  the  radar  measured  reflectivity  factor 
we  assume  that  the  radar  resolution  volume  is  filled 
homogeneously  with  raindrops.  In  the  Rayleigh 
scattering  limit  the  radar  reflectivity  factor  Z  is 
then  defined  as. 


Z 


6  -3 
mm  m 


(6) 


where  the  summation  applies  to  raindrops  within  a 
volume  AV.  Since  the  range  of  Z  can  be  quite  large 
we  define  dBZ  «  10  logjg(Z).  There  is  a  general 
correlation  between  Z  and  rain  intensity  expressed 
by  a  power  law  of  the  form  Z-aR. 

Zawadzkl  (1984)  has  intercompared  radar  Z 
measurements  with  surface  disdrometer  measurements 
of  rain  Intensity.  He  has  analyzed  a  number  of 
factors  which  can  cause  discrepancies  between 
radar-derived  rain  rates  and  surface-measured  rain 
rates.  Sampling  errors  affect  both  the  surface 
measurements  (due  to  Inadequate  sample  volume)  as 
well  as  the  radar  measurements  of  mean  power  (or 
dBZ)  due  to  the  finite  Doppler  velocity  spectrum. 
Systematic  errors  can  be  caused  by  radar 
calibration  problems,  as  well  as  by  the  non- 
coincidence  (in  space-time)  of  the  raindrops 
measured  by  the  radar  as  opposed  to  the  raindrops 
which  actually  Impact  or  the  surface  disdrometer. 
Fig.  la  taken  from  Zawadrki  (1984)  shows  a  scatter 
plot  of  rain  intensity  versus  Z^gp  (estimated  from  a 
Joss-Waldvogel  disdrometer  using  Eq.  (6))  while 
Fig.  lb  shows  rain  Intensity  versus  radar-measured 
Z.  Zawadzkl  notes  that  the  variability  in  R  for  a 
given  ZRSD  is  significantly  less  (^  factor  of  3) 
than  the  variability  in  R  for  the  same  radar- 
measured  Z.  By  assuming  that  the  discrepancies 
between  Figs,  la,  b  were  due  to  physical  causes, 
Zawadzkl  noted  that ,  ".  .  .  we  must  cone lud-  that  the 
variability  of  the  drop-size  distribution  is  a 
relat ively  minor  factor  affecting  the  precision  of 
radar  estlsMtes  of  rain  rate  "  We  have  simulated  a 
similar  experiment  corresponding  to  the  data  of 
Figs,  la,  b  and  show  In  Section  5  that  the 
discrepancies  between  Figs,  la,  b  can  be  accounted 
for  by  statistical  fluctuations  onlv. 

As  another  example,  we  consider  the  relat  ionshlp 
between  NQ  and  m  (see  Eq.  (1))  derived  by  Ulbrich 
(1983)  using  RSDs  measured  by  a  Joss-Waldvogel 
dlsdroMter.  Ulbrich  estimates  Nn,  m,  and  Dn  based 


on  higher  moments  of  the  RSD.  This  procedure 
places  more  weight  on  the  larger  drop  sizes  as 
compared  to  MLEs  which  place  more  emphasis  on  drop 
sizes  having  a  higher  frequency  of  occurrence,  Wong 
and  Chidambaram  (1985).  Fig.  lc  shows  a  scatter 
plot  of  logjQNp  versus  m  taken  from  Ulbrich  (1983). 
From  this  data,  Ulbrich  concludes  that  physical 
processes  result  in  a  Np-m  relation,  and  that, 
effectively,  the  three-parameter  gamma  RSD  reduces 
to  a  two-parameter  form.  We  have  simulated  this 
experiment  and  show  in  Section  5  that  the 
relationship  between  Nq  and  m  is  due  to  the 
statistical  correlations  between  estimators  of  the 
higher  order  moments  of  the  RSDs.  The  above  two 
examples  imply  that  statistical  fluctuations  must 
be  separated  from  variations  Induced  by  physical 
causes.  Simulations  offer  a  powerful  method  of 
studying  the  statistics  of  radar  and  surface 
measurements  where  the  "natural"  fluctuations  can 
be  Introduced  separately.  However,  such  simula¬ 
tions  Involve  large  scale  computations  on  a 
supercomputer  since  the  physical  parameters  must  be 
varied  over  a  wide  range. 

2.  Surface  Disdrometer  Measurements. 

Gertzman  and  Atlas  (1977)  have  shown  that ,  in 
raindrop  sampling  devices  such  as  dlsdrometers,  the 
measurement  variability  is  due  both  to  statistical 
sampling  errors  and  to  real  fine-scale  physical 
variations  which  are  not  readily  separable  from  the 
statistical  ones.  Sasyo  (1965)  and  Cornford 
(1967)  have  shown  that,  for  a  constant  mean  rain 
intensity,  the  total  number  of  raindrops  observed 
will  be  distributed  about  its  mean  according  to  the 
Poisson  distribution.  This  property  haB  been  used 
by  Joss  and  Waldvogel  (1969),  Gertzman  and  Atlas 
(1977)  and  Wong  and  Chidambaram  (1985)  to  obtain  the 
fractional  standard  deviations  of  higher  order 
moment  estimators  (which  correspond  to  radar 
measurements)  of  RSDs.  In  this  work  we  use  a 
somewhat  different  approach  so  that  the  correlation 
structure  of  higher  order  RSD  moment  estimators  can 
be  computed. 

We  re-write  Eq.  (2)  in  the  form  of  a  gamma  pdf, 

.it+l 

f  (D)  •  - — —  Dm  exp(-AD)  (7) 

r(m+l) 


In  the  following  development  we  assume  that  the 
san  1 lng  volume,  V,  is  constant  and  does  not  vary 
with  raindrop  size.  If  V  does  vary  with  D,  then 
this  dependency  can  be  introduced  by  multiplying 
f(D)  bv  the  sampling  volume  function  V(D). 

If  n  raindrops  ire  observed  with  a  fixed  sample 
volume  V  with  diameters  Dj ,  D^  .  .  .  D  ,  then  this  RSD 
is  a  composite  distribution  of  total  number  of 
raindrops  (or  equivalently,  the  concentr.it  ion  of 
drops  within  any  interval  D  to  D+AD)  and  the  drop 
diameter,  where  the  diameters  are  distributed 
according  to  the  gams  pdf,  and  the  total  number  of 
raindrops  (n)  are  distributed  according  to  the 
Poiason  distribution. 

Conventional  estimators  of  higher  order  RSD 
moments  are  expressed  as  follows: 
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LWC  =  J 
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where  LHC  stands 
above  estimators 
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for  liquid  water  content, 
can  be  written  In  general  as 


The 


P 


a 


C  £ 

V2-  I  »«“  where  pa 
V  1-1  1 


Ca/DaN(D)  dD.  (8d) 


He  can  now  find  the  mean  of  paas, 
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where  E(  )  stands  for  the  expected  value.  The  above 
expectation  Is  that  of  a  random  sum.  Assuming  the 
DfS  are  lid  we  have, 

E(JX  ■  E(  E(J1  Di>  >  (9) 
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where  E^(  )  Is  the  expectation  with  respect  to  the 
total  number  of  raindrops  and  Eg(  )  Is  the 
expectation  with  respect  to  the  gamma  pdf.  Hence, 
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Tt  is  easily  verified  that  pa  (s  an  unbiased 
estimator  of  p^.  Similarly, 
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He  note  that  the  variance  decreases  with  Increase  In 
sample  volume,  at  expected.  The  fractional 
standard  deviation  (FSD)  of  p  Is 
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Eq.  (12)  Is  identical  to  Eq.  (29)  of  Gertzman  and 
Atlas  (1977). 

He  now  derive  _  the  correlation  between  two 

estimators  p  and  p„  defined  as, 
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The  covariance  between  pn  and  pg  Is, 
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By  conditioning  on  ^  the  covariance  can  be  written 
as,  jj  ji  n  n 
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Since  the  D^s  are  lid,  covfDj01,  Dj&]  -  0  when  1  f  j. 
Therefore,  the  first  term  on  the  right  hand  side  of 
Eq.  (14)  simplifies  to  E(^i)  cov(D j  D®),  while  the 
second  term  simplifies  to  E(Da)E(D®)var(n) .  Thus, 
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Combining  with  Eq.  (lib)  we  get  the  correlation 

coefficient  p  „  as 
a,B 
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It  Is  interesting  to  note  that  the  correlation 
coefficient  between  p^  and  p  Is  independent  of  Nq 
and  Dq.  In  Fig.  2  we  show  plots  of  p  „  versus  m, 
where  pg  •  ZRSD  (radar  reflectivftfy)  whilepg 
represents  estimators  with  B  -  0  (concentration), 
B-  1  (mean  raindrop  size),  8  “  2  (optical 
extinction),  B  -  3  ( liquid  water  content ) ,  and  B  “ 
3.67  (rain  Intensity,  R).  Note  that  estimators  of 
ZRSD  an<*  R  are  highly  correlated  and  nearly 
Independent  of  the  shape  (m)  of  the  gamma  pdf. 
Hence,  experimental  data  which  show  scatter  plots 
of  versus  R  obtained  from  dlsdrometers  must  be 
careftilly  interpreted,  i.e.,  experimentally 
derived  correlations  will  contain  the  effects  of 
both  physical  correlations  as  well  as  statistical 
correlations. 

In  order  to  simulate  p  we  need  to  derive  its 
distribution  function.  Since  pQ  is  a  moment 
estimator  with  a  random  sum,  analytic  derivations 
are  hopelessly  complicated  especially  for  values  of 
a>0.  Hence,  we  resort  to  simulating  pn(fora-6 
and  3.67)  by  first  simulating  the  RSD.  TyMcal 
concentrations  of  raindrops  vary  from  10^  to  l(r  per 
m  ,  Gordon  and  Marwitz  (1984).  Considering  a  100 
litre  (0.1  m3)  sample  volume,  the  number  of 
raindrops  can  vary  between  10  to  10,000  and  the 
number  Increases  with  sample  volume.  "*o  simulate 
the  variability  in  concentration  we  hive  to  use 
Poisson  deviates.  Then  for  each  of  these  numbers 
(jg)  we  need  to  simulate  )j  gamma  deviates  to 
represent  the  RSD.  Next  the  physical  larasmters 
(Np,  m  and  Dq)  of  the  gamma  RSD  must  be  viried  over 
the  range  of  observed  values.  For  example, 
consider  Nq  -  8000  mm"3m"  ,  Dq  “  2  ass  i  nd  ■  -  0 
(exponential  RSD)  and  V  -  0.1  m  .  To  obtain  a 
scatter  plot  of  Zgqp  versus  R  under  these  conditions 
it  takes  123  seconds  of  execution  time  on  the 
CSU/Cvber  180/830  computer.  It  Is  apparent  that 
large  computing  power  is  needed  to  hancle  such 
intensive  computational  needs. 


3.  Radar  Meaauresmnts. 

Radar  measurements  of  reflectivity  Z  Involve 
estimation  of  Z  from  the  measurement  of  mean  (time 
averaged)  backscattered  power  (?)  from  a  given 
radar  resolution  volusm,  Doviak  and  Zrnlc  (1984). 


The  sample  backseat tered  power  is  the  incoherent 
sum  of  powers  backscattered  by  each  raindrop  within 
the  resolution  volume.  The  fluctuation  of  the 
backscattered  power  is  due  to  relative  motion 
between  the  various  drops,  and  can  be  related  to  the 
width  of  the  Doppler  velocity  spectrum.  To  obtain 
the  mean  power,  P,  power  samples  must  be  averaged 
over  a  large  number  of  pulses.  The  total  number  of 
pulses  depends  on  the  radar  pulse  repetition  time 
(PRT),  and  the  dwell  time  of  the  antenna  beam  on  the 
resolution  volume.  Doviak  and  Zrnic  (1984)  show 
that  the  power  samples  have  exponential  marginal 
distributions.  Zrnic  (1975)  has  developed  a 
procedure  for  simulating  the  time  series  of  power 
samples  assuming  a  Gaussian  Doppler  velocity 
spectrum.  He  use  his  method  in  our  work  and  refer 
to  Zrnic' s  paper  for  details.  The  principal 
assumptions  we  make  are  as  follows  (for  a  typical 
meteorological  radar): 


Doppler  Spectrum 
Radar  Reflectivity 
Pulse  Repetition  Time 
Number  of  Samples 
Radar  Wavelength 
Receiver 


Gaussian,  2  «  6ms' 

N0  r(srt-7){D  /(3.67+ci)} 
1  millisec 
128 
10  cm 

Power  Law  (no  noise) 


Our  radar  simulations,  for  a  given  mean  power  (or, 
reflectivity)  involve  length  N(128)  exponential 
deviates,  and  length  N  complex  FFTs.  As  the 
reflectivity  is  varied  (by  changing  Nq,  Dq,  or  m)  we 
see  that  the  radar  simulations  are  also 
computationally  intensive. 

4.  Simulations . 

As  discussed  in  Sections  2  and  3,  the  surface 
disdrometer  and  radar  simulations  are  computa¬ 
tionally  intensive  requiring  simulations  of  gamma, 
Poisson  and  exponential  deviates,  as  well  as 
complex  FFTs.  These  simulations  are  repeated  many 
times  as  the  physical  parameters,  namely,  Nq,  Dq  and 
m  of  the  RSD  are  varied  over  a  considerable  range  of 
values  commonly  found  In  rainfall.  Thus,  these 
simulations  are  ideal  for  implementation  on  a 
vector  computer  like  the  CSU/Cyber  205. 

4.1  Exponential  Random  Deviates. 

The  inverse  CDF  technique  for  generating 
exponentials  is  used  here.  We  generate  them  from 
uniform  (0,1)  deviates  and  take  the  negative 
logarithm.  This  method  is  easily  vectorizable 
since  the  CDF  is  in  closed  form  and  conditional 
checks  for  specific  values  can  be  avoided. 
Exponentials  with  means  differing  from  unity  are 
needed  here.  Some  timing  runs  made  on  the 
CSU/Cyber  205  indicate  that  a  length  100  string  of 
deviates  can  be  computed  2.2  times  faster  than  the 
scalar  method,  whereas  the  speed-up  factor 
Increases  to  4.8  and  6.4  for  strings  of  length  500 
and  2500,  respectively.  This  speed-up  factor  is 
important  in  our  work  since  a  large  number  of  these 
simulations  are  needed. 

4.2  Gamma  Random  Deviates. 

Kennedy  and  Gentle  (1980)  discuss  a  number  of 
methods  of  simulating  gamma  random  deviates. 
Among  the  various  algorithms,  the  one  proposed  by 
Cheng  (1977)  appears  to  be  most  suitable  for  vector 
implementation.  This  method  is  an  adaptation  of 
the  envelope  rejection  technique  and  is  described 
in  Appendix  A. 


4.3  Timing. 

Simulation  of  50  samples  of  radar  and  surface 
measurements  with  Nq  *  8000  mm”'m"  ,  Dq  ■  2  mm,  m  “  0 
and  V  ■  0.1  nr  took  125  seconds  on  the  CSU/Cyber 
180/830  whereas  the  vector  code  took  2.9  seconds  of 
execution  time  on  the  Cyber  205.  A  complete 
simulation  with  Nq,  Dq  and  ra  varying  took  160 
seconds  on  the  Cyber  205  for  a  1  nr  sample  volume. 

5.  Discussion  and  Results. 

We  now  apply  our  simulation  results  to  two  types 
of  problems,  namely 

a)  radar-surface  disdrometer  intercompari¬ 
sons  ,  and 

b)  Inferences  on  gamma  RSD  parameters  using 
surface  disdrometer  measurements. 

5. 1  Radar/ Disdrometer  Intercompar isons . 

In  Section  1  we  discussed  Zawadzki's  (1984) 
interpretation  of  radar/disdrometer  intercompari¬ 
sons  as  shown  using  his  Figs,  la,  b.  In  Figs.  3-  6 we 
show  our  simulation  results  where  Figs.  3a-6a  show 
scatter  plots  of  surface  rain  intensity  versus 
Zjjsd>  while  Figs.  3b-6b  show  scatter  plots  of  rain 
intensity  versus  radar  Z.  The  physical  parameters 
Nq,  Dq  and  m  are  varied  differently  in  each  Fig.  3-6  . 
For  example,  in  Fig.  3a,  b  we  represent  one  rainfall 
condition  with  Nq  «  8000  Dq  •  2  mm  and  m  “  0. 

Fig.  3a  shows  very  good  correlation  between  R  and 
ZrSD  whereas  in  Fig.  3b,  R  appears  uncorrelated  with 
radar  Z.  In  Fig.  3b  note  that  the  distribution  of  R 
is  asymmetric.  We  now  vary  the  physical  parameter 
m  from  0.5  to  5  in  Figs.  4a,  b  with  the  same  Nq,  Dq 
values  as  before.  The  correlation  in  Fig.  4a  is 
significantly  higher  than  in  Fig.  4b.  In  Figs.  5a, 
b  we  keep  Nq  »  8000  ram"  1m‘  ,  m  “  0.5  and  vary  Dq  from 
0.8  to  2.8  mm.  In  Fig.  6a,  b  we  vary  Nq,  Dq  and  m 
simultaneously  over  a  broad  range  of  values  that  can 
physically  occur.  Again,  the  same  feature  is 
deduced,  i.e.,  R  versus  ZRSD  is  more  tightly 
correlated  than  R  versus  radar  Z  over  a  wide  variety 
of  physical  rainfall  conditions.  We  also  note  that 
the  magnitude  of  the  discrepancy  is  of  the  same 
order  shown  by  Zawadzki's  (1984)  Fig.  la,  b.  Our 
results  are  obtained  for  an  ideal,  noise- free  radar 
with  the  radar  resolution  volume  completely  filled 
by  a  homogeneous  rain  medium  identical  to  that 
sampled  by  the  disdrometer.  This  implies  that  even 
without  considering  the  various  physical  factors 
enumerated  by  Zawadzkl  (1984),  we  can  observe  a 
similar  variability  in  our  simulations  as  was 
observed  in  the  experimental  data  shown  in  Figs,  la, 
b.  Our  variations  were  obtained  solely  by  the 
statistical  nature  of  the  measured  quantities 
without  any  physical  changes  or  instrumentation 
problems.  Thus,  it  is  not  possible  to  conclude 
that  the  discrepancy  between  Figs,  la,  b  can  be 
accounted  for  by  physical  causes  as  enumerated  by 
Zawadzkl  (1984).  Therefore,  it  is  also  not 
possible  to  agree  with  Zawadzki's  conclusion  that 
the  variability  of  the  RSD  is  a  relatively  minor 
factor  affecting  the  precision  of  radar  estimates 
of  rain  intensity. 

6.2  Disdrometer  Inferences. 

In  Section  1  we  discussed  Ulbrich's  (1983) 
conclusion  that  the  three-parameter  gamma  RSD  in 
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fact  reduces  to  a  two-parameter 


RSD  with  Nn 


and  m  being  related  by  Nq  ■  6  x  10*  exp(3.2  m),  see 
Fig.  lc.  Ulbrlch  (1983)  estimates  the  parameters 
Nq,  Dq  and  m  from  disdrometer  measured  RSDs  using 
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three  moment  estimators,  viz., 

/D  N(D)  dD  x 
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Using  experimentally  measured  RSDs,  G*,  Dn  and  LWC 
are  calculated  using  discrete  versions  of  the 
integral  in  Eqs.  17a,  b,  c,  from  which  the  gamma  RSD 
parameters  Nq,  Dq  and  m  are  inferred.  The  result  of 
this  "inversion"  procedure  is  given  as  a  scatter 
plot  of  logjgNQ  versus  m  in  Fig.  lc  which  is  taken 
from  Ulbrich  (1983).  If  a  linear  relationship 
between  logjgNg  and  m  exists  for  natural  gamma- 
parameterized  RSDs,  then  in  essence  the  gamma  RSD  is 
reduced  to  a  two-parameter  form. 

To  study  the  statistical  fluctuations  in  the 
moment  estimators  defined  by  Eqs.  17a,  b,  c  we  use 
the  Ulbrich  inversion  procedure  with  simulated 
gamma  RSDs.  In  Fig.  7  we  assume  gamma  RSDs  with  Nq  * 
8000  nm'^m*^,  m  “  0  and  V  *  1  m  ,  while  Dq  m  1  mm. 
Observe  in  this  figure  that  estimate  of  logjgNg 
varies  linearly  with  estimate  of  m,  and  the  slope 
is  within  the  range  theoretically  derived  by 
Ulbrich  (1983).  In  Fig.  8  Nq,  Dq  and  m  are  varied 
over  a  wide  range  to  encompass  the  full  range  of 
naturally  occurring  RSDs.  Our  scatter  points  lie 
within  the  two  straight  lines  in  Fig.  8  derived 
theoretically  by  Ulbrich  (1983)  using  a  large 
number  of  empirical  Z  «  aRb  relations.  Comparison 
of  Fig.  8  with  Fig.  lc  also  shows  that  our  simulated 
(Ng,m)  pairs  lie  within  the  experimental  scatter 
derived  by  Ulbrich  ( 1983) .  This  raises  the  obvious 
question  of  whether  the  Ng-m  relationship  derived 
by  Ulbrich  is  due  to  physical  causes  or  whether  it  is 
due  to  statistical  correlations  between  the  various 
moment  estimators  defined  InEq.  17.  Note  that  our 
simulations  do  not  preclude  the  existence  of  a 
physical  Ng-m  relation;  the  implication  of  Fig.  8  is 
that  other  methods  may  need  to  be  used  to  determine 
if  a  physical  Ng-m  relation  Indeed  exists. 
Finally,  in  Fig.  9  and  10  we  show  scatter  plots  of  m 
versus  Dq,  and  Nq  versus  Dq  using  the  simulations. 
Since  these  scatter  plots  indicate  the  correlation 
between  the  estimates  is  quite  low  it  Implies  that 
if  significant  correlations  between  these 
parameters  are  observed,  then  it  is  a  real  physical 
observation. 

Conclusions. 

We  have  considered  a  class  of  statistical 
simulations  which  are  computationally  intensive 
and  amenable  to  implementation  on  a  vector 
computer.  We  have  simulated  two  totally  different 
types  of  measurements,  viz.,  radar,  and  surface 
dlsdrometer,  measurements  of  rainfall.  These 
simulations  involve  exponential,  Poisson  and  gamma 
random  deviates.  The  problem  is  a  large  scale  one 
since  the  parameters  describing  the  rainfall  must 
be  varied  over  a  wide  range.  Thus,  we  have  complete 
control  over  the  physical  and  statistical 
variables. 

We  have  applied  our  simulations  to  explain  why 
the  correlation  is  less  in  plots  of  radar  measured 
reflectivity  verses  surface  measured  rain 
intensity  as  compared  to  plots  when  both  quantities 


are  obtained  from  surface  Instruments.  Previous 
interpretations  have  ascribed  this  feature  to 
physical  causes.  While  physical  factors  are 
important  when  comparing  radar  measurements  of 
rainfall  to  surface  measurements  of  rain  Intensity, 
it  is  Important  to  have  a  good  measure  of 
statistical  variabilities  before  ascribing  the 
features  to  physical  causes  alone,  Zawadzkl  ( 1984) . 

We  have  also  applied  our  simulations  to  show  that 
Ng-m  relations,  as  derived  by  Ulbrich  (1983)  using 
experimental  dlsdrometer  raindrop  size  distri¬ 
butions  and  certain  higher  order  moment  estimators, 
cannot  be  ascribed  to  physical  causes.  The 
simulation  results  Indicate  that  the  moment 
estimators  are  correlated  resulting  in  a  high 
degree  of  correlation  between  Nq  and  m,  even  when 
the  gamma  distribution  parameters  were  widely 
varied.  Though  our  simulations  do  not  preclude  the 
existence  of  a  physical  Ng-m  relation  (which  is 
Important  for  radar  measurements  of  rain 
intensity),  it  suggests  that  other  methods  may  be 
needed  to  confirm  this. 

Acknowledgements 

The  authors  are  grateful  to  Dr.  Hari  Iyer  of 
Colorado  State  University  for  many  helpful 
discussions  and  for  a  critical  reading  of  this 
paper.  This  work  was  supported  by  the  U.S.  Army 
Research  Office  via  NCAR  Subcontract  #83024.  One 
of  the  authors  (VC)  is  a  Fellow  of  CSU*  s  Institute  of 
Computational  Studies  which  provided  Cyber  205 
computer  resources  for  this  study. 

APPENDIX  A 

The  standard  gamma  has  the  following  distribution 
function i 


a-1  -x 
x  e 


v  '  r(x)  * 

Using  Cheng's  notation,  let 


be  finite.  Take  a  pair  of  Independent  u(0,l) 
variables  Uj  and  U2  say.  Let  x  »  G~*(UJ.  Then  if 
[f(x))/[Mg(x)J  _>  U2  accept  x,  otherwise  reject  it. 
Each  accepted  x  has  density  f(x).  Cheng  suggests 
using  f(x)  same  as  (Al)  and 

g(x)  =  — — -  (A3) 

(P+  xV 


M,  the  expected  number  of  trials  varies  between 
1.47  and  1.13  as  a  varies  from  1  to  °°.  The 
advantage  of  this  method  is  it  gives  a  reasonably 
small  rejection  ratio  which  translates  into 
starting  vector  length  of  uniform  deviates  not  much 
greater  than  the  length  of  string  of  gamma  required. 
This  method  also  has  one  decision  taking  spot  which 
can  be  easily  accommodated  by  using  control  bit 
vectors  that  would  have  bit  value  "1"  for  accepted 
elements  of  the  vector  that  can  be  gathered  later. 
The  following  steps  show  equivalent  vector  form  for 
Cheng's  algorithm.  (We  denote  vectors  with  an 
arrow  above  the  symbols.) 

Step  It  Generate  a  pair  of  uniform  random  vectors 
0  and 
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Step  2:  Set  f.  “  »  log  (Uj/d-Uj)}  , 

*  “  Se  -  -► 

Step  3i  If  b  +  cV  -  X  >  log  [Uj  •  U,  • 
set  bit  vector  elements  to  1. 
The  constants  arei 


V,I 


W/2 


a  -  (2a-l)‘  . 

and  c  "  a  +  a” 1 


a-  log  4 


Note  that  In  the  above  equations,  arithmetic 
operations  like  multiplication,  division  and 
exponentiation  Imply  element  by  element  operation. 
The  length  of  these  vectors  n' ,  is  Mn  where  n  is  the 
vector  length  of  random  deviates  required.  If  the 
number  of  "1"  bits  fall  short  of  n'  then  those  few 
random  deviates  can  be  computed  by  scalar  version  of 
algorithm.  Timing  runs  made  on  Cyber  205  indicate 
that  length  100  deviates  can  be  computed  2.95  times 
faster,  whereas  the  speed-up  increases  to  6.1  for 
length  50  and  7.1  for  length  2500. 
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Fig,  la.  Scattergram  of  disdrometer  derived 
values  of  Z  and  R  for  data  In  the  Toronto,  Ont. 
region  taken  during  1977/78/79  summer  seasons  (from 
Richards  and  Crosier,  1983).  Each  point 
represents  a  7  min.  sample. 


Fig-  lb.  Scattergram  of  radar  measured  Z  and 
disdrometer  derived  R  for  the  same  events  as  in  Fig. 
la  (from  Richards  and  Crosier). 
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squares  fit  to  all  the  data.  The  solid  line  is  the 
l«aat  squares  Nq-b  line  from  empirical  Z-R 
relationships.  Also  shown  as  (Z,R)  Is  the  point 
deduced  from  esiplrlcal  Z-R  relation  which  applies 
to  these  data  (from  Ulbrlch,  1983). 


LLt: _ !•  Correlation  of  reflectivity,  dBZ,  with 

other  meteorological  quantities  of  interest 
plotted  as  a  function  of  m.  The  various  curves 
represent  correlation  of  dBZ  with  (  Q) 
concentration,  (.  )  mean  particle  sire,  (  a)  mean 
particle  sire  squared,  (+)  liquid  water  content  and 
(X)  rainfall  rate. 
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FIr.  3a.  Scatterplot  of  rainfall  rate  versus  dBZ 
where  both  are  derived  from  the  drop  spectra 
observed  by  dlsdrometer.  This  figure  shows  50 
realizations  under  the  condition  Nn  •  R000 

(sT*  mm'1),  D0  -  2.0  mm  and  m  -  0. 
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Fig.  3b.  Same  as  Fig.  3a  except  dBZ  is  "measured" 
by  radar. 
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Fig.  4a.  Scatterplot  of  rainfall  rate  versus  dBZ, 
both  observed  on  ground.  This  figure  corresponds 
to  Nq  •  8000,  Dq  ■  2 .0  mm  and  m  varying  between  0. 5  to 
5. 


Fig,  4b.  Same  as  Fig.  4a  except  dBZ  is  "measured" 
by  radar. 
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8ig.  5a.  Scatterplot  of  rainfall  rate  versus  dBZ 
with  Bp  *  8000,  m  *  0.5  and  Dn  varying  between  0, 8  to 
2.8  mm. 
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Fig.  5b.  Same  as  Fig.  5a  except  dBZ  is  "measured" 
by  radar. 
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Fig.  6a.  Comprehensive  scatter  plot  of  rainfall 
rate  versus  dBZ,  both  of  which  are  derived  from 
ground  observations.  Variation  of  Ng,  m  and  Dg  are 
done  to  cover  a  wide  range  of  suggested  Z- R 
relationships,  Ulbrich  (1983). 
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Fig.  7.  Scatterplot  of  Ng  versus  m  estimates  for 
average  values  of  Ng  "  8000,  m  »  0  and  Dg  ”  2  mm.  The 
estimates  are  tightly  correlated  with  log  Ng  and  m 
being  linearly  related  with  slope  *  0.95. 
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Fig.  9.  Scatterplot  of  (m,  Dg)  estimate  pairs  for 
parameters  at  In  Fig.  7.  Note  the  very  weak 
correlation  between  the  estimates. 
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Fig.  6b.  Same  as  Fig.  6a  with  dBZ  "measured"  by 
radar.  The  Doppler  spectrum  variance  has  been 
varied  linearly  between  1  and  6  m/s  in  proportion  to 
the  values  of  reflectivity. 
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Fig.  8.  Global  scatterplot  of  Ng  versus  m 
estimate^  where  Ng  is  varied  between  200  to  2  x  10“ 
(m  mm  “) ,  Dg  between  0. 5  to  2 . 5  mm,  and  m  between 
0  to  5.  Note  that  the  scatterplot  exhibits  a 
correlation  structure  between  log  Ng  and  m.  The 
two  dotted  lines  indicate  the  boundaries  of  (Ng,m) 
relationships  derived  theoretically  by  Ulbrich 
(1983). 
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Fl8-  lf>-  Scatterplot  of  (Ng.Dg)  estimate  pairs 
for  parameters  as  In  Fig.  7.  Note  the  very  weak 
correlation  between  the  estimates. 
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STATISTICAL  DATABASE  MANAGEMENT  ON  MICROCOMPUTERS:  THE  BENCHMARK  PROBLEMS 

Robert  F.  Tat  to I 

TEITEL  DATA  SYSTEMS 
Bethesda,  MD  20(14 

The  purpose  of  this  paper  is  to  present  the  data  files  and  data  manipula¬ 
tion  problems  posed  for  the  participants  of  the  session  on  “Benchmarking 
Data  Management  Capabilities  of  Microcomputer-based  Statistical  Systems" 
The  data  files  described  herein  ware  distributed  to  10  vendors  of  micro- 
computar-based  statistical  software,  with  the  intent  that  they  prepare 
solutions  to  the  problems.  These  solutions  would  then  be  presented  at  the 
10th  Interface  Symposium,  and  would  be  subjected  to  comparative  perform¬ 
ance  benchmarking  on  a  common  machine  by  the  author. 


The  data  management  problems  described 
herein  have  a  long  history.  They  were 
first  used  to  provide  a  focus  for  my 
discussant  role  at  the  "Workshop  on  High 
Dimensional  Files:  Large  or  Complea”, 
held  as  part  of  the  11th  Interface 
Symposium  in  I9?t.  Though  the  problems 
circulated  in  the  statistical  computing 
community  for  some  time  thereafter,  they 
were  not  published  until  Proceedings  of 
the  13th  Symposium  in  ltd  [Eddy  lf(l). 
For  that  Symposium,  data  tapes  were 
created  and  distributed,  and  sis  vendors 
of  statistical  software  on  main-frame 
computers  presented  solutions  to  the 
problems.  The  problems  were  published 
again  in  the  Proceedings  of  the  First 
LBL  Workshop  on  Statistical  Database 
Management  [Wong  19821,  togothor  with 
tho  solutions  of  another  set  of  sis 
databaso  and  statistical  systam  vendors. 

The  problems  were  published  a  third  time 
as  the  appendis  to  a  papar  on  "Statisti¬ 
cal  Database  Management:  A  Benchmark 
Comparieon  Among  Statistical  and  Data¬ 
baso  Systoms"  ITeltel  lfllbl  which  eon- 
contratad  on  tho  different  results 
obtained  for  ono  of  tho  problems.  Since 
the  reeulte  of  the  required  data  manipu¬ 
lation  worn  to  be  presented  as  simple 
cross-tabulations,  it  was  quite  surpri¬ 
sing  to  discover  solutions  in  which 
the  cell  counts  differed. 

For  tho  present  round,  tho  problem  de¬ 
scriptions  and  floppy  disks  containing 
the  data  files  were  sent  to  10  vendors 
of  micro-computer  statistical  software. 
(All  has  agreed  in  principal  to  partici¬ 
pate  in  the  esercise.)  A  departure  from 
the  previous  rounds,  in  which  only  data 
manipulation  tunc t 1 ona 1 i t v  was  itressod, 
this  round  included  a  performance  compo¬ 
nent.  Each  of  tho  vendors  would  sand  to 
mo  their  completed  solutions  to  the 
problems,  and  enough  of  their  system  to 
permit  me  to  raplieate  the  solutions  on 
a  common  machine  (or  two).  The  results 
ef  those  performance  measures  are  pre¬ 
sented  in  a  subsequent  paper. 


The  problems  were  desi 
responses  to  three  fun 
a  typieal  user  would  h 
ted  with  the  analysis 
data,  as  is  becoming  i: 
common.  The  first  ques 
a  potential  system  per 
data  management  at  all 
question  would  be  "wha 
have  to  do  to  get  the 
the  necessary  data  man 
These  two  questions  we 
the  earlier  functional 
ercises.  The  third,  an 
a  typieal  user  would  ai 
sources  does  the  systei 
performing  the  necessa 
lation  1  ".  Hence  our  ] 
in  comparative  perform. 


gned  to  elicit 
idamental  questions 
ave  when  confron- 
of  non-rectangular 
ncreasingly  more 
tion  would  be  “can 
form  the  necessary 
»  "  The  second 
t  do  I,  the  user, 
system  to  perform 
i pula  t i on  *  ". 
re  the  basis  of 
benchmarking  a z - 
d  f inal ,  quest  ion 
sk  is  "what  Te¬ 
rn  consume  while 
ry  data  manipu- 
present  Interest 
lance  analysis. 


Though  appearing  at  first  glance  to  be 
an  ad  hoc  set  of  data  descriptione  and 
tabulatery  requests,  the  problems  are 
based  on  an  underlying  concept  or  model 
of  statistical  database  management 
articulated  elsewhere  (Teitel  1982al. 


The  problem  set  consists  of  the  descrip¬ 
tion  of  two  data  collections,  and  two 
data  manipulation  esercisos  for  each 
data  collection.  The  data  manipulation 
eietcises  are  stated  in  terms  of  tho 
desired  result,  that  is,  simple  cross- 
tabulations.  The  format  or  structure  of 
the  tabulations  is  not  of  primary 
concern  here  it  is  the  data  manipula¬ 
tion  necessary  to  prepare  the  data  for 
the  tabulation  step  which  is  of  interest 
here 


TRIPS  is  a  large  collection  of  data 
cosisting  of  four  groups  of  variables 
(variously  called  segments,  relations, 
tables,  levels,  or  record  typos).  The 
groups  of  variables  are  related  to 
each  other  as  shown  in  the  following 
diagram,  and  furthsr  esplained  in  the 
test 
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Each  household  contains  a  variable 
number  of  cars  and  a  variable  numbe 
persons;  each  person  contains  a  var 
number  of  trips.  Ve  include  sero  in 
definition  of  "variable  number".  In 
addition,  each  trip,  if  taken  in  a 
owned  by  the  household,  will  contai 
identification  of  that  car.  There  a 
counts  of  persons  or  ears  in  the  ho 
hold  record,  nor  is  there  a  count  o 
trips  in  the  person  record.  The  ear 
record  contains  a  model-year  variab 
the  person  record  an  age  variable, 
the  trip  record  a  duration  variable 
addition  to  tho  own-car  variable  al 
mentioned.  All  records  contain  appr 


r  of 
table 
the 

ear 
n  the 


le, 

and 
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ready 
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described  in  the  record  layout  section. 

The  first  of  the  two  cross-tabulations 
to  be  produced  from  tho  TRIPS  data 
collection  is  a  simp  le  count  of  house¬ 
holds  by  the  number  of  cars  owned  and  by 
the  number  of  persons  over  the  age  of  14 
in  the  household.  Tho  table  should  be 
something  similar  to  the  following. 

(The  table  definition  should  have  made 
more  esplicit  the  possibility  of  0  cars 
and/or  I  adults  by  including  a  0  car  row 
and  a  0  adult  column.  As  given,  some 
vendors  escluded  those  households  with  0 
cars  and/or  0  adults.) 


I  cars 


The  second  cross-tabulat ion  to  be 
produced  from  the  TRIPS  data  is  a  simple 
count  of  trips  of  at  least  three  days' 
duration  taken  In  a  car  owned  by  the 

taking  the  trip  by  the  model  year  if  the 
car,  as  shown  at  the  top  of  the  neat 
column . 


trip  I 
count  > 


model  year  of  car 
19  70  1 ?  7 1  1772 


age 

I 

I 

I 


The  two  tabulations,  though  superficial¬ 
ly  similar,  require  different  complea 
data  manipulation  capabilities.  Each 
table  consists  of  a  count  of  occurrences 
of  one  record  type  based  on  variable 
values  in  other  record  types.  The  first 
table  requires  "downward"  access,  from 
household  to  cars  and  persons;  the 
second  table  "upward  access",  from  trips 
to  persons  and  cars. 


PEOPLE  is  a  large  data  collection  of 
various  data  elements  on  people.  The 
data  collection  is  relatively  closed 
with  respect  to  ancestry:  for  most 
people,  data  on  their  parents,  grand 
parents,  etc.,  as  well  as  on  their 
children  are  contained  in  the  data 
collection.  The  variables  available  for 
each  person  include  the  year  of  birth, 
the  level  of  education,  sea,  and  the 
identification  of  the  mother  and  father, 
if  known.  The  records  of  (an  apparently 
rectangular  file)  are  in  identification 
numbor  order,  but  not  necessarily  in  any 
relationship  order. 

The  first  of  the  two  cross-tabulat ions 
to  be  produced  from  the  PEOPLE  data 
collection  is  a  simple  count  of  off¬ 
spring  by  the  educational  level  of  each 
parent,  as  follows. 


t  of  f spr  tng 
I  count 


(mother ■ s 
I  edueat ion 


father's  education 
elem  hs  coll 


The  second  cross-tabulat ion  to  bo  pro¬ 
duced  from  the  PEOPLE  data  collection  is 
a  count  of  the  "last  births"  by  the  age 
•f  the  mother  at  the  birth  el  her  leet 
child  by  the  eee  of  that  leet  child.  The 
count  of  "last  births"  could  be  inter¬ 
preted  as  either  the  count  of  mothers  at 
last  birth  or  as  a  count  of  the  youngest 
child  of  each  mother. 
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Thsss  tabulations  also  require  diffarsnt 
data  aanipulation  capab  i  1  i  t i ss ,  and  dif- 
fsrsnt  accsss  paths  to  ths  nicissat; 
data.  Ths  first  rsquiros  siapls  "upward** 
accsss  to  ths  sducation  variablss  of  ths 
parsnts.  Ths  sscond  tabulation,  if 
viewed  as  a  count  of  aothsrs.  raquiras 
scanning  ths  children  to  dstsrains  ths 
youngest;  if  viewed  as  a  count  of  the 
youngest  child,  it  requires  scanning  the 
siblings  to  deteraine  the  youngest. 


Frograas  have  bean  written  to  generate 
the  data  for  the  TRIPS  and  the  PEOPLE 
data  collections.  The  pregraas  parait 
the  specification  of  the  basic  structure 
paraaeters,  how  aany  households,  or  how 
aany  trips  per  person,  and  the  ranges  of 
the  values  of  each  of  the  variables.  The 
values  of  the  variables  necessary  for 
the  tabulations  havo  reasonable  values, 
the  values  of  other  filler  variables  are 
iaaaterial 

*.  The  TRIPS  Data  Collection  is  dis¬ 
tributed  as  four  separate  files,  naaad 
HHOLDS.DAT.  CARS  OAT.  PERSONS  DAT.  and 
TRIPS  OAT,  containing  appros ima to  1 y  12k, 
23k,  38k.  and  113k  bytao,  respectively, 
in  HS-DOS  2.00*  format.  Tho  record 
formats  for  these  files  are  as  follows. 


1-4  household  identification 
3  record  type  •  *  1  * 

4-7  ignore 

10-««  :  31  one-digit  variables 
41-100:  20  two-digit  variables 


101-120 : 

Car  rscord 

1-120  : 
3 

4-7 

The  I  < 


3  four-digit  variables. 


1-120:  as  above,  eacept  for 
3  record  typo  •  *  2  * 

4-7  car  within  household  id. 
The  last  four-digit  variable  ii 
the  model  year  of  the  car  . 


1-120:  as  above,  eacept  for 
9  :  record  type  e  *  9  * . 

4-7  :  person  within  household  id. 

Tho  first  two-digit  variable  is 
the  age  ef  the  person. 


Trip  record: 

1-120:  as  above,  eacept  for 

3  record  type  •  *4*. 

4-7  person  within  household  id. 

8-7  trip  within  person  id. 

Ths  first  one-digit  variable  is 
the  id  of  the  car  used  tor  the 
trip  (with  a  value  of  0  meaning 
other  conveyance). 

The  last  two-digit  variable  is  the 
duration  of  the  trip. 

All  variables  of  all  rocord  types  should 
be  retained  in  the  system  file  or 
database . 

B.  The  PEOPLE  Data  Collection  is  dis¬ 
tributed  as  a  single  file  named 
PEOPLE.DAT  of  approa ima t e I y  73k  bytes  in 
HB-DOS  2  00«  format,  and  consists  of 
only  a  single  record  type  The  format  of 
the  record  is  as  follows. 

Poople  record: 

1-3  parson  id  number 
4-34  31  one-digit  variables,  the 

first  of  which  is  the  sea 
variable  (*l‘>female. 

*  2 1 -ma 1 e  >  . 

37-84  IS  two-digit  variables,  the 
first  of  which  is  the 
aducatlea  variable  (81 
17  years  of  schooling) 

87-118:  4  four-digit  variables,  the 
first  of  which  la  the  year 
of  birth  (1888.  > 

111-113  tho  id  of  this  person's  mother 
114-128  tho  id  of  this  person's  father 

Again,  ail  variables  should  be  retained 
in  the  system  file  or  databaoo 
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Making  tha  Push  and  Shove  of  Data  Manageawnt  Easier: 
Examp  I  as  of  Four  Fils  Prob ! ams 

Mary Ann  Hill  and  Laszlo  Engalman 


Whan  data  ara  collactad  thay  frequently  fail  to 
fora  the  nice  tidy  rectangle  required  for  analyses 
based  on  classical  statistics.  Often  research 
projects  suffer  considerable  time  delays  and  extra 
costs  by  not  having  easy-to-use  tools  to  ready  the 
data  for  analysis.  A  nee  system  developed  by 
BMDP,  the  Data  Manager,  makes  such  tasks  easier 
We  use  the~bM  system  here  to  solve  four  file  prob¬ 
lems  DM  is  a  file  manipulating  tool  designed  to 
handle  most  file  problems  encountered  in  research 
projects,  but  it  is  simple  enough  that  users  com¬ 
fortable  * i th  packaged  statistical  softeare  can 
specify  complex  operations  eithout  help  from  a 
programmer  DM  includes  20  instruction  paragraphs 
(READ,  SORT ,  AGGREGATE,  etc  )  that  can  be  assem¬ 
bled  m  a  variety  of  says,  a  I  loving  the  separation 
of  a  complex  task  into  small  manageable  pieces 
that  are  assembled  step  by  step  m  a  logical  man¬ 
ner  The  resulting  collection  of  paragraphs  reads 
easily  and  is  self  documenting 

The  four  problems  assigned  to  softeare  devel¬ 
opers  at  the  1086  Interface  esstmgs  llustrate 
several  of  the  functions  of  DM  The  first  teo 
problems  use  four  separate  files  containing  rec¬ 
ords,  respectively,  for  cars,  people,  households, 
and  trips  In  Problem  1,  information  aggregated 
from  the  car  and  people  f . I es .  is  added  to  the 
household  file  In  Problem  2.  the  trips  file  is 
expanded  by  add i ng  information  from  the  car  and 
people  files  Values  are  replicated  ehen  cars 
and/or  people  take  Mltipie  trips 

The  last  teo  problems  concern  a  fie  containing 
information  about  several  generations  of  people 
In  Problem  3.  records  for  each  person's  parents 
are  found  (f  present)  and  information  on  each 
parent's  educat-on  is  added  to  each  of  their  chil¬ 
dren's  records  In  Problem  4,  the  iast  child  born 
to  each  mother  .s  identified  a"d  ■ ts  birth  year 
used  to  compute  the  mother's  age  at  the  brth  0f 
that  child 

•e  first  present  our  strategy  and  nstructons 
for  the  solot-on  of  these  four  file  problems  and 
then  h,ghi,ght  additional  supports  the  user  may 
need  to  accompi i »h  these  tasks  for  a  real  study 
It  thou  I d  be  pointed  out  that  the  ■ n  <  t  < 1 1  instruc¬ 
tions  are  not  a  minimal  set  included  addi¬ 

tional  instructions  that  a  user  e < th  s<m. I ar  tasks 
should  consider  for  example  m  the  first  prob¬ 
lem  about  cars,  people  and  households  ee  could 
have  ignored  the  household  records  and  simply  used 
the  household  <o  to  link  the  aggregated  results 
for  cars  and  people  But  se  eanted  to  identify 
households  e  i  th  neither  cars  nor  people,  because 
m  a  real  study,  they  My  indicate  errors  In  the 
kid/mom/pop  task  ee  included  instructions  to  sort 
the  file,  even  though,  by  eye,  ee  could  see  that 
the  generated  data  eere  sorted  properly  for  the 
Mrge  operations 

Strategy  and  Solutions  for  the  Four  Problems 


Pr ob I em  1  Me  begin  eith  three  separate  f i  les 
eith  information  about  cars,  people,  and  house¬ 
holds  Among  the  79  variables  stored  on  the  car 
record  there  is  a  household  id,  a  car  id,  and  the 
model  year  of  the  car  (it  e ■ I  I  be  used  in  Problem 


2).  The  people  records  also  contain  a  household 
id  plus  the  person’s  age  and  id.  The  household 
records  contain  no  car  or  people  identifiers. 

The  goal  is  to  tabulate  the  number  of  adults 
over  16  years  in  each  household  by  the  number  of 
cars  owned  by  the  household.  The  results  obtained 
from  the  table  program  BMDP  4F  are  displayed  in 
F I gure  1 . 
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F i Qure  1  The  number  of  cars  and  adults  per 
househo I d 

These  data  are  weird  There  are  twelve  households 
with  no  people  (6*6  in  the  first  column);  but  the 
records  indicate  the  household  has  one,  two,  or 
•ore  cars  Possibly  of  greater  concern  are  the 
seven  (4*3)  households  that  on  I y  report  kids  under 
16  ywri ,  but  report  smiltiple  cars. 

In  Figure  2,  ee  display  the  DM  instructions  to 
ermte  the  input  data  file  for  this  table.  These 
instructions  aggregate  the  car  and  people  inforam- 
1 1 on  by  household  and  jom  it  to  the  records  in 
the  household  file  That  is,  we  want  the  number 
of  cars  with  each  unique  household  id  and  the 
number  of  people  over  16  years  eith  each  unique 
househo Id  id 

More  specifically,  the  instruction  paragraph 
labeled  |1,  in  Figure  2,  READS  the  CAR  file  eith 
210  records  Each  record  contains  79  variables 
stored  m  fixed  locations  The  model  year  of  the 
car  is  variable  79  We  use  BMDP’s  convenient 
FORTRAN  type  forMt  reader  ehere  the  specification 
to  read  a  four  character  field  is  written  F4  in¬ 
stead  of  F4  0 

Note  that  the  first  dozen  lines  of  Figure  2 
document  the  instructions.  The  notation  fi  is 
used  to  reference  instruction  paragraphs  below. 
On  each  l me  of  instructions,  the  DM  reader  ig¬ 
nores  text  following  a  number  (f)  sign.  For  easi¬ 
er  reading,  we  mcented  the  instructions  within 
each  paragraph  This  is  not  necessary,  for  DM 
instructions  are  written  m  free  forsut. 

In  |2,  the  N  function  in  the  AGGREGATE  para¬ 
graph  is  used  to  count  the  number  of  cars  per 
household  (howte_id).  The  N  function  is  one  of 
more  than  teo  dozen  functions  available  in  the 
ACCRECATE  paragraph  for  extracting  summary  infor¬ 
mation  from  values  on  different  records.  Other 
functions  include  minimum,  maximum,  mean,  total, 
slope,  etc.  The  output  work  file  (c  per  h)  con¬ 
tains  one  record  per  household  e  i  th~Tiome_i  d  and 
the  number  of  cars. 

A  similar  counting  task  is  carried  out  in  para¬ 
graphs  §3  and  |4 .  This  time  ee  READ  the  PERSONS 
file  containing  records  for  315  people.  AGE  is 
the  55th  variable  for  each  subject.  Note  that  the 
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I 
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1.  Read  the  car  f i le. 

2.  Count  the  nuaber  of  care  per  household. 

3.  Read  the  persons  file. 

4.  Count  the  nuaber  of  people  over  16  years  in 
each  household. 

5.  Read  the  household  file. 

6.  Join  the  3  files  side-by-side  linking  records 
by  hoae_id. 

7.  When  ca~  and/or  people  info  is  aissing,  set 
result  to  0. 

8.  Store  the  file  in  a  system  file. 

9.  Delete  files  that  are  no  longer  needed. 


READ 

SFILE  IS  ‘CARS.DAT*. 

VNAMES  ARE  home  id,  (3)car_id,  (79) mode I  yr. 

FORMAT  IS  ‘f4,  Tl,  f2,  2x,  Slfl,  20f2,  574’.  /  #1 


AGGREGATE 

WITHIN  IS  home  id. 

’#  cars’  =  N  (  home  id  ). 

NEWFILE  IS  c_per_h.  /  |2 

READ 

SFILE  IS  ‘PERSONS.DAT’ . 

VARIABLES  ARE  79. 

VNAMES  ARE  home  id,  (3) person id,  (55) age. 

FORMAT  IS  ‘f4,  Tl,  f2,  2x,  Slfl,  20f2,  Sf4*.  /  |3 

AGGREGATE 

WITHIN  IS  homejd. 

‘#  adults’  =  N~ (  home  id,  age  >  16). 

NEWFILE  IS  p_per_h.  /  f 4 


READ 

SFILE  IS  ‘HHOLDS.DAT’ . 

VARIABLES  ARE  78. 

VNAME  IS  home  id. 

FORMAT  IS  ‘f4,  Tl,  4x,  Slfl,  20f2,  5f4’.  /  #5 

JOIN 

FILES  ARE  HHOLDS,  c  per  h,  p  per  h. 

KEY  IS  homejd.  -  -  - 

KEEP  *  home  id,  'f  cars’,  •#  adults’. 

PRINT  =  'HC.t,  <H..’. 

NEWFILE  IS  probleal.  /  #6 

TRANSFORM 

IF  (’#  cars’  EQ  XMIS)  THEN  '#  cars'  m  0. 

IF  (■#  adults’  Eq  XMIS)  THEN  ‘|  adults’  =  -1.  /  #7 


N  function  (in  |4)  incorporates  a  condition  about 
ACE  —  ago  >  16.  That  is,  within  each  household 
id,  only  tha  people  over  16  years  of  age  will  be 
counted. 

The  READ  paragraph  labeled  |5  reads  105  house¬ 
hold  records.  In  the  JOIN  paragraph  (|6),  ho*e_i d 
is  used  as  a  key  to  link  the  household  records 
with  the  appropriate  counts  of  cars  and  adults, 
ie  could  have  omitted  the  household  records  from 
this  JOIN  FILES  list,  but  we  wanted  to  check  for 
households  with  no  cars  and  people.  A  household 
with  car(s)  and  no  people  would  also  be  strange, 
so  we  insert  a  request  to  print  the  key  (home_id) 
for  each  of  these  unusual  occurrences.  We  specify 

PRINT  =  *HC.\  *H..\ 

where  a  period  (.)  in  each  three  character  literal 
string  indicates  a  missing  record,  a  letter  (H  or 
C)  indicates  that  the  record  from  the  Household  or 
Car  file  is  present.  So  at  this  point~during  our 
Interactive  run  we  identify  the  12  households  with 
one  or  more  cars  and  no  people  and  three  house¬ 
holds  with  no  cars  or  people.  We  didn't  have  to 
wait  to  the  table  making  step  (Figure  1). 

When  there  is  no  car  record  or  people  record 
available  to  JOIN  with  the  household  record,  DM 
pads  the  positions  of  the  values  with  the  missing 
value  flag  XMIS.  In  the  TRANSFORM  paragraph  (|7) 
we  change  the  XMIS  flag  for  *f  cars*  to  zero. 
That  is,  if  there  is  no  c_per  h  (cars  per  house¬ 
hold)  record  then  the  household  had  zero  cars.  If 
the  count  of  adults  is  missing  then  there  are  no 
people  for  that  household.  However  note  that  if  a 
household  does  have  people,  but  no  one  is  over  age 
16,  the  N  function  produces  a  zero.  We  decided  to 
distinguish  between  no  people  and  no  adults,  so  we 
changed  the  XMIS  flag  (created  when  there  is  no 
p_per_h,  people  per  household,  record)  to  -1. 

In  the  SAVE  paragraph  (|8) ,  we  save  the  counts 
of  cars  and  people  by  household  in  a  BMDP  File. 
When  the  table  program  BMDP  4F  reads  this  file, 
the  names  '#  cars'  and  *|  adults'  will  be  stored 
with  the  data.  If  a  household  only  has  people  age 
16  or  less,  ‘|  sdults*  is  zero;  if  there  is  no  one 
with  that  household  id,  the  value  is  -1. 


SAVE 

CODE  IS  probleml.  NEW. 

SFILE  IS  ‘probleml.sav’ .  /  #8 

DELETE 

FILES  «  HHOLDS,  probleml,  c_per_h ,  p_per_h.  /  #9 


Figure  2.  DM  instructions  for  PROBLEM  1. 

Aggregating  car  and  people  information 
by  household. 


Problem  2.  The  structure  of  the  file  manipulation 
task  in  this  problem  is  opposite  to  that  in  Prob¬ 
lem  1.  Instead  of  aggregating  or  accumulating 
across  multiple  car  and  people  records  per  house¬ 
hold  we  generate  replicates  of  car  and  people 
records  to  link  with  trip  records  —  that  is  if  a 
car  or  person  makes  more  than  one  trip.  The  goal 
for  this  task  is  to  tabulate  information  for  each 
trip  —  model  year  of  the  car  used  by  age  of  the 
person  taking  the  trip.  The  results  from  the 
table  program  BMDP  4F  are  displayed  in  Figure  3. 
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Figure  3.  Model  year  of  car  and  age  of  person  taking  the  trip. 


10. 


In  Figure  4,  we  show  the  DM  instructions  to 
crests  the  input  file  for  this  table.  The  numbers 
for  the  steps  of  the  date  Manipulation  tasks  con¬ 
tinue  fro«  Proble*  1,  because  they  were  executed 
during  the  sane  interactive  conputer  session.  The 
SORT  parsgraphs  (flO  and  f  11)  ensure  that  the 
PERSONS  and  CAR  records  are  sorted  within  each 
Household  key  (hone  id)  by  person  id  and  car_id, 
respectively.  AC^is  retained  (KEEP)  on  each 
person's  record. 

Sort  the  PERSONS  file  using  home_id  and 
person  id.  ” 

Sort  the  CARS  file  using  hone_id  and  car_id. 
Read  the  trips  file. 

Sort  the  TRIPS  file  and  join  it  with  the 
PERSONS  file,  using  home_id  and  personid.  The 
period  in  the  HOTKEY  instruction  requests  that 
the  person  record  be  replicated  when  s/he  goes 
on  more  than  one  trip. 

Sort  the  ‘trips+age*  file  and  join  it  with  the 
CARS  file,  using  the  home_id  and  car_id.  The 
HOTKEY  instruction  requests  that  the- car  info 
be  replicated  if  it  goes  on  more  than  one  trip. 
Delete  records  with  less  than  3  ‘duration*  or 
if  the  car  id  code  is  0. 

Store  the  Tile  in  a  system  file. 
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11. 

12. 

13. 


14. 


15. 

16. 


PERSONS. 


CARS. 


person i d . 

personid,  age. 

/  #10 

car_id. 

/  #H 

READ 

SFILE  IS  ‘TRIPS.DAT* . 

VARIABLES  ARE  80. 

VNAMES  ARE  home  id, (3)personid,trip_id,car_id, 
(75y3urat i on .  “ 

FORMAT  IS  ‘f4,  fl,  2f2,  Slfl,  20f2,  5f4*. 

KEEP  a  home  id,  personid,  trip  id,  car  id,  duration. 

/  #12 


SORT 

KEY 

JOIN 

FILES 

KEY 

HOTKEY 

PRINT 

DROP 

SORT 

KEY 

JOIN 

FILES 

KEY 

HOTKEY 

DROP 

KEEP 

NEWFILE 


z  home_id,  personid. 

X  TRIPS,  PERSONS, 
x  home  id,  personid. 
x  *T.\~ 

=  ‘p’. 
x  ‘.p*. 


x  home_id,  car_id. 


/  #13 


/  #13 


/  #14 


x  TRIPS,  CARS, 
x  home_id,  ear  id. 
x  ‘t.’T 

x  ‘  .e’ . 

x  home  id,  car_id,  age,  duration,  model _yr. 
x  probTem2.  /  #14 


TRANSFORM 

IF  (duration  It  3  OR  car_id  eq  0)  THEN  USE  x  -1./  #15 


SAVE 

CODE  x  problem2. 

SFILE  X  ‘ probl cm2 .say ’ . 


NEW. 


/  #16 


Figure  4.  DM  instruct ions  for  PROBLEM  2. 

Replicating  car  and  people  records 
when  each  Makes  Multiple  trips  and 
joining  with  trip  information. 


In  instruction  paragraph  |12,  we  READ  the  945 
record  TRIPS  file.  Each  input  record  contains  80 
variables  and  we  KEEP  the  length  of  the  trip  (dur¬ 
ation)  and  ids  for  the  respective  home,  person, 
car,  and  trip,  le  SORT  this  file  by  person_id 
within  each  household  (hoate  id). 

In  the  JOIN  paragraph  (fl3)  we  take  each  per¬ 
son's  ACE  and  link  it  to  their  trip  record  using 
person  id  and  home_id  as  Merge  keys.  The  period 
in  the~HOTKEY  instruction  requests  that  the  person 
record  be  replicated  when  s/he  goes  on  More  than 
one  trip.  We  use  the  instruction 

PRINT  x  «.p\ 

to  list  id's  for  people  who  do  not  go  on  a  trip. 
The  DROP  instruction  deletes  these  records  froM 
the  output  work  file  that  now  contains  trip  infor- 
Mation  plus  age.  This  file  is  then  sorted  by 
home  id  and  car  id. 

In  the  JOIN -paragraph  (§14)  the  MODEL_YEAR  of 
the  car  used  (from  the  CAR  file)  is  joined  with 
the  ‘trips  A  age*  records.  The  period  in  the 
HOTKEY  instruction  requests  that  the  car  informa- 
tion  be  replicated  if  the  car  goes  on  More  than 
one  trip. 

The  instructions  for  this  probleM  requested 
that  trips  lasting  less  than  three  days  be  deleted 
from  the  report  and  also  that  values  of  car_id 
equal  to  zero  not  be  used.  Code  0  for  car~id 
indicates  that  an  airplane  or  vehicle  other  than  a 
car  was  used  for  the  trip.  The  instruction  USE  * 
-1  in  the  TRANSFORM  paragraph  (|15)  deletes  rec¬ 
ords  with  short  trip  duration  and/or  invalid 
car  ids.  The  SAVE  paragraph  (#16)  saves  the  re¬ 
sulting  file  for  use  as  input  to  prograe  BMDP  4F. 

Problem  3.  In  this  problem  our  input  data  file 
contains  information  about  people.  Each  record 
contains  an  id,  the  sex  of  the  person,  their  level 
of  education  (in  years),  and  their  year  of  birth 
(birth _jrr) .  In  addition,  the  record  contains  the 
id  of  the  person's  mother  (moM_id)  and  their  fath¬ 
er  (pop_id).  Thus,  this  one  rectangular  file 
contains*  records  for  kids,  moms,  pops,  grandpar¬ 
ents,  and  possibly  great-grandparents.  For  Prob¬ 
lem  3,  our  goal  is  to  find  each  persons  mother’s 
record  and  father’s  record  in  the  file,  thus  en¬ 
abling  us  to  link  the  parents’  educational  level 
to  each  record.  We  want  to  tabulate  mother’s 
education  versus  father’s  education  as  shown  in 
Figure  5.  Obviously  the  data  are  generated,  note 
the  123  women  who  are  college  graduates  who  mar¬ 
ried  dolts  —  men  with  only  an  eleaientary  school 
education.  This  table  was  obtained  in  program 
BMDP  4F. 
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Figure  5.  Education  of  mother  versus  education  of 
father . 


Our  strategy  is  to  make  two  copies  of  the  ori¬ 
ginal  file.  We  will  take  Mother  records  out  of 
one  copy,  father  records  out  of  the  other  copy  and 
children  records  fro*  the  original  file.  When  we 
make  the  mom's  file  we  rename  id,  education,  and 
birth  yr  to  «oe_i d ,  Mom_educ,  and  mom_b  yr,  re¬ 
spectively.  For* the  file  copy  that  we  will  use  as 
father's  data,  the  id  and  education  are  renamed 
pop_id  and  pop_educ.  Of  course  each  copy  of  the 
file  has  the  same  records  as  the  original  file, 
but  we  will  use  the  id’s  to  select  only  the  'mom' 
records  from  the  mom  file  and  only  the  'pop'  rec¬ 
ords  from  the  pop  file.  In  Figure  6,  for  child  45 
we  link  her  mom’s  education  (15  years)  and  birth 
year  (1830)  and  her  father’s  education  (18  years) 
to  the  values  on  her  record. 


Chi  Id  Mom  Pop 


In  Figure  7,  we  display  the  DM  instructions  to 
copy  the  file  and  link  the  respective  mom  and  pop 
data  with  that  of  each  of  their  children.  In 
paragraph  fl,  we  READ  the  PEOPLE  file  containing 
784  records  and  75  variables.  We  will  find  later 
that  744  of  these  784  people  have  parents  in  the 
file.  We  will  use  the  'kids’  records  in  this  file 
and  SORT  it  by  mom_i d  (paragraph  |2) . 

We  use  the  EXTRACT  paragraph  (|3)  to  make  the 
first  copy  of  the  file.  We  will  use  the  'moms’ 
records  in  this  file  so  we  call  it  'moms’  and 
rename  the  variables  from  id  to  mom_id,  educatn  to 
momjeduc,  birth_yr  to  mom_b_yr.  ” 

We  next  use  mom  id  as  a  key  to  JOIN  (#3)  the 
mom’s  record  side-Fy-side  by  that  of  each  of  her 
children.  The  literal  string  in  the  HOTKEY  in¬ 
struction  includes  one  position  for  each  file 
being  joined.  The  period  (.)  in  'k.’  requests 
that  the  mother’s  records  be  replicated  when  she 
has  more  than  one  child.  When  a  particular  mom  id 
is  present  in  the  moms  file  but  not  in  the  kids 
file,  no  output  will  be  made  because  of  the  in¬ 
struction 

DROP  »  ' .m* . 

For  example,  in  Figure  6,  the  code  6  is  listed  in 
the  moms  file  as  a  mcmjd,  but  8  is  not  listed  as 


1.  Read  the  PEOPLE  file. 

2.  Consider  this  to  be  the  "kids*  file  and  sort  it 

by  mom_id.  Sex  and  birth_yr  are  needed  in 
Problem  4.  ~ 

3.  Copy  (extract)  the  PEOPLE  file  and  call  it  the 
’moms*  file  (all  people  in  the  moms  file  are  not 
moms,  however).  Rename  variables,  changing  id 
to  mom_id,  educatn  to  mom_educ,  etc.  Join  the 
kids  aid  moms  file  side-b7-side  using  the  mom_id 
to  link  records.  The  HOTKEY  instruction  re-  — 
quests  that  the  mother  record  be  replicated  when 
she  has  more  than  one  child.  The  DROP  instruc¬ 
tion  deletes  records  when  a  match  for  the  mom_id 
in  the  moms  file  is  not  found  in  the  kids  file. 

4.  Sort  the  new  *kids  ♦  moms*  file  by  pop_id. 

5.  Copy  the  PEOPLE  file  making  it  be  “pops*. 

Rename  id  to  pop  id,  educatn  to  pop_educ. 

Join  the  pops  fiTe  to  the  *kids  ♦  eioms*.  The 
HOTKEY  and  DROP  instructions  work  as  in  |3  above 
but  with  respect  to  replicating  father’s  records 
and  eliminating  pops  without  kids. 

6.  Store  the  file  in  a  system  file. 
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Fiaure  7. 

DM  instructions  for  PROBLEM  3. 

Linking 

mother  and  father  records  to  those  of 

each  of  their  children. 

a  mom_id  in  the  child’s  file  (actually  6  is  an  id 
for  a  father).  A  report  in  the  output  tells  us 
that  571  records  were  dropped  because  sn  id  in  the 
mom’s  file  was  not  matched  in  the  child’s  file. 
Therefore  we  figure  that  there  were  213  mothers 
with  children  (784-571*213).  The  output  work  file 
contains  records  for  children  plus  their  mom’s 


education  and  har  birth  year.  Wo  SORT  thia  file 
by  P°P_id  (i*) • 

Wo  uao  the  EXTRACT  paragraph  (#5)  to  Make  a 
second  copy  of  tho  original  file.  Wo  will  uso  the 
popa  records  from  this  file  so  we  change  the  name 
id  to  pop  id  and  educatn  to  pop_educ. 

In  JOlR  (|S)  we  use  pop_i d  to  link  the  father's 
education  with  hia  chi  Id’s  record.  The  HOTKEY  and 
DROP  instruction  are  used  in  the  same  way  as  ex¬ 
plained  previously  for  the  moms  file.  The  output 
work  file  from  JOIN  contains  the  desired  mom  and 
pop  values  appended  to  their  respective  children’s 
records.  These  records  are  saved  in  a  BMDP  File 
(|6)  as  preparation  for  input  to  program  BMDP  4F. 

Problem  4.  For  this  task  we  are  to  identify  the 
last  child  born  for  each  mother  and  compute  the 
mother’s  age  at  the  birth  of  this  child.  The  goal 
is  to  tabulate  mother’s  age  by  sex  of  this  last 
child.  We  display  these  results  (obtained  from 
program  BMDP  4F)  in  Figure  8. 
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Figure  8.  Sex  of  last  child  by  mother’s  age  at 
birth  of  last  chi  Id. 

The  DM  instructions  to  create  the  input  data 
file  for  this  table  are  displayed  in  Figure  9.  As 
input  for  this  task  we  use  the  data  file  created 
for  PROBLEM  3  that  contains  records  with  the 
child’s  data  plus  mother’s  education,  mother’s 
birth  year,  and  father’s  education.  We  SORT  this 
file  by  mom_id  (#7). 

f  7.  Sort  the  "kids  ♦  moms  ♦  pops"  file  by  mom  id. 
f  8.  For  each  mom,  find  the  sex  of  her  last  chTld 
f  and  her  age  when  the  child  was  born. 

|  9.  Store  the  file  in  a  system  file. 

| - 

SORT 

KEY  a  mom_id.  /  |7 

AGGREGATE 

WITHIN  a  mom_id. 

RUSE  a  birth  yr  EQ  MAX(birth  yr) . 

I ast_sexa  FVAL (sex) . 

moms_age=  FVAL(birth_yr  -  mom_b_yr) . 

KEEP  a  mom_id,  last_sex,  moms_age.  /  |8 

SAVE 

SFILE  a  1  problem*. sav’ .  NEW. 

CODE  =  problem*.  /  |9 

FINISH  / 

Figure  9.  DM  instructions  for  PROBLEM  4. 

Identifying  the  last  child  born  to 
each  mother  and  computing  mother’s  age 
at  laat  birth. 

Because  we  just  sorted  the  file  by  mom_id,  the 
records  for  the  children  of  each  mother  form  a 
set.  That  ia,  the  recorda  for  the  children  of  the 
first  mother  form  the  first  set,  followed  by  the 
records  for  the  children  of  the  second  mother. 


etc.  The  instructions  in  the  ACGREGATE  paragraph 
(|8)  are  executed  for  each  set  of  records  (all 
recorda  with  the  aame  mom  id).  The  RUSE  (or 
RECORD  USE)  instruction  se facts  the  record  from 
each  set  that  has  the  MAXIMUM  value  of  birth  year. 
That  is,  the  latest  date  or  the  *laat  child  born* 
to  that  mother .  The  FVAL  factor  (FIRST  VALUE) 
picks  the  sex  code  from  the  record  "of  this  last 
born  child.  The  age  of  the  mother  at  birth  of  her 
last  child  is  computed  in  the  argument  of  the  FVAL 
function  as  the  difference  between  the  child's 
birth  year  and  the  mother’s  birth  year.  This 
moms_age  and  sex  code  for  the  last  child  are  out¬ 
put  with  the  mom’a  id  and  stored  in  a  BMDP  File 
for  i  nput  to  program  BMDP  4F  (|9) . 

Additional  Supports  for  the 
Data  Manipulation-  Tasks 

In  estimating  the  time  to  solve  these  four 
problem,  total  time  to  do  the  task  is  an  important 
factor.  Instead  of  recording  time  to  execute 
already  debugged  instructions,  it  would  be  more 
meaningful,  if  possible,  to  record  the  computer 
time  to  assemble  and  debug  the  correct  set  of 
instructions.  Ideally,  the  shortest  time  should 
occur  in  an  interactive  setting  where  the  user  can 
access  reports  and  features  that  identify  both 
miatakes  in  the  program  instructions  and  errors  in 
the  data  (e.g.,  incorrect  keys),  correct  the  mis¬ 
takes,  and  immediately  rerun  the  step  in  error. 

We  now  describe  DM  features  that  we  utilized 
during  the  interactive  development  of  the  correct 
instructions.  As  we  tackled  the  four  problems  we 
asked  many  questions.  For  example. 

Did  the  File  Mer^e  Work?  After  complex  file  mer- 
ging  operations  it  ia  helpful  to  scan  a  data  lis¬ 
ting  of  the  results.  In  the  Problem  3  kid/mom/pop 
task,  we  inserted  a  PRINT  paragraph  after  JOIN  (|5 
in  Figure  7)  and  immediately  identified  two  women 
married  to  dolts  (see  Figure  10).  The  mothers  of 
subjects  181  and  186  each  have  over  18  years  of 
education,  but  their  husbands  report  only  two  or 
three  years.  We  initially  thought  we  had  made  a 
mistake  but  checked  the  input  data  and  found  these 
results  to  be  correct.  Before  this  point  other 
checks  are  necessary. 
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Figure  10.  After  a  merge  command,  a  data  listing 
is  used  to  check  the  kid/mom/pop 
output  records  in  PROBLEM  3. 
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Did  We  Specify  the  Record  Format  Correctly?  After 
reading  the  CARS  file  in  PROBLEM  1,  DM  returns  a 
report  on  the  record  format  to  the  screen.  We 
checked  this  to  see  if  we  read  the  car  records 
correctly.  See  Figure  11. 


VARIABLE 

RECORD 

COLUMN 

INPUT 

NO.  NAME 

NO. 

BEG 

END 

FORMAT 

1  home  id 

1 

1 

4 

F4.0 

2  V2 

1 

5 

5 

F1.0 

3  car  id 

1 

6 

7 

F2.0 

4  V4 
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10 

10 
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11 

11 
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78  V78 
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79  mode  l_yr 
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Figure  11.  The  codebook  for  car  records  indicates 
that  the  model  year  is  stored  as  the 
79th  variable  in  character  positions 
117  to  120. 

In  addition,  as  these  records  were  being  read 
during  our  interactive  session,  the  DM  system 
reported  a  record  tally  every  50  records  and  the 
total  records  in  the  file  after  reading  was  com¬ 
pleted.  The  record  tally  is  also  reported  for 
SORT,  MERGE,  JOIN,  and  AGGREGATE  operations. 

What  Cases  Have  Problems?  When  linking  informa- 
tion  in  PROBLEM  1  from  the  Household,  Cars,  and 
People  Files,  we  wondered  if  any  households  had 
cars  but  no  people.  The  PRINT  paragraph  (§6, 
Figure  2)  requests  a  report  of  household  keys  for 
output  records  with  the  status  *HC.’  where  HC 
means  that  the  Household  and  Cars  records  are 
present  and  the  period  (.)  indicates  that  the 
People  record  is  missing.  We  also  requested  a 
report  on  keys  with  the  pattern  *H..’  (both  car 
and  people  records  are  missing  for  that  house¬ 
hold).  In  the  resulting  report  we  found  12  house¬ 
holds  reporting  a  car  or  cars  and  no  people  and 
three  households  with  no  people  or  cars.  For 
example,  for  pattern  HC.  the  keys  are  7,  14,  21, 


28,  etc.  report  cars  and  no  people;  for  pattern 
H..,  keys  35,  70,  etc. 

In  PROBLEM  2  (#13)  we  requested  a  report  lis¬ 
ting  both  the  household  key  and  the  person  id  for 
people  who  did  not  take  trips.  From  the  resulting 
report  we  learned  that  person  1  in  household  4 
(the  22nd  case  in  the  file)  did  not  take  a  trip, 
person  4  in  household  5  (the  43rd  case  in  the 
file)  did  not  take  a  trip,  etc. 

Other  Conveniences  During  Interactive  Execution. 
What  happens  if  you  misspell  an  instruction?  The 
Data  Manager  does  not  abort  the  job  and  drop  you 
into  the  system.  Instead  the  instructions  just 
executed  return  to  the  screen  with  line  numbers; 
without  leaving  the  program,  you  can  make  correc¬ 
tions  using  the  BMDP  Line  Editor  and  immediately 
execute  them. 

During  an  interactive  session,  you  can  also 
access  system  commands  without  exiting  the  pro- 
grain.  If  you  forget,  say,  a  file  name  you  type 


and  your  system  directory  will  scroll  across  the 
screen.  Any  system  command  may  follow  the  excla¬ 
mation  (!)  —  when  execution  of  the  system  command 
is  completed,  control  returns  to  DM. 

If  you  forget  the  name  of  a  DM  command,  you  can 
request  online  help  by  typing,  for  example 

HELP  READ.  / 

The  program  then  returns  a  brief  definition  of 
READ  paragraph  commands  to  the  screen. 

If  you  request  a  printout  of  your  interactive 
session  your  DM  instructions  are  easy  to  find  and 
they  are  readable.  A  row  of  equal  signs  (=)  pre¬ 
cede  and  follow  each  paragraph  of  instructions, 
clearly  setting  the  user’s  instructions  apart  from 
DM  reports  and  responses.  Scanning  such  a  print¬ 
out  is  useful  for  retracing  your  steps  at  a  later 
time  or  for  someone  else  to  join  in  on  the 
project. 

Thus,  the  Data  Manager  is  a  convenient  and 
comprehensive  tool  for  preparing  data  for  analy¬ 
sis;  and,  in  addition,  the  DM  instructions  are  the 
same  for  many  systems  ranging  from  the  IBM  PC  to 
mainframe  computers. 
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PRODAS  :  PROFESSIONAL  DATABASE  ANALYSIS  SYSTEM 


Henry  Feldman,  conceptual  Software,  Inc. 


PRODAS  is  an  acronym  for  the 
Professional  Database  Analysis  System. 
PRODAS  combines  powerful  database 
management  with  a  large  array  of 
statistical  routines  and  graphics  into 
an  integrated  system.  It  is  command 
driven  with  syntax  similar  to  SAS 
(SAS  Institute  Inc.).  PRODAS  has  the 
most  sophisticated  database  management 
and  data  entry  capabilities  found  in  a 
software  system  with  statistical 
routines. 

In  1985,  Dr.  Robert  Teitel  asked 
developers  of  the  major  IBM/PC  data 
analysis  packages  to  participate  in  a 
benchmarking  problem  set.  Dr.  Teitel 
wanted  to  evaluate  how  the  different 
packages  performed  in  solving  data 
management  problems.  All  vendors  were 
informed  that  the  packages  would  be  run 
on  one  computer  and  timed.  He  were  also 
told  that  the  timings  would  be  presented 
at  the  symposium. 

Why  test  software  packages  for  data 
management  capabilities  at  a  statistical 
conference?  Most  statisticians  who 
process  data  know  that  an  important  part 
of  their  work  is  collecting  accurate 
data  and  getting  the  data  ready  for 
analysis.  The  ability  to  quickly  and 
easily  restructure  data,  manipulate 
databases  and  produce  reports  is  very 
important  to  a  statistician. 

How  should  we  define  quickly  and 
easily?  Does  quickly  refer  to  the 
amount  of  time  the  computer  takes  to  run 
a  problem,  or  should  it  refer  to  the 
amount  of  time  the  user  must  spend  to 
solve  the  problem.  Dr.  Teitel  told  the 
vendors  that  he  would  compare  the 
computer  time  for  each  package. 
Therefore,  the  vendors  job  was  to  write 
a  program  that  had  the  shortest  running 
time.  Since  programming  time  was  not 
measured,  we  could  spend  hours,  days,  or 
weeks  modifying  the  program  for  the 
minimum  run  time. 

At  the  conference  Dr.  Teitel  raised 
the  question,  "Is  running  time  that 
important?"  For  example,  to  load  the 
initial  databases  took  PRODAS  5  minutes, 
and  it  took  SAS  15  minutes.  This  means 
that  PRODAS  runs  3  times  faster  than 
SAS.  Now,  if  it  takes  60  minutes  to 
write  the  program  to  read  the  files, 
then  the  total  time  to  process  the  file 
is  65  minutes  for  PRODAS  and  75  for  SAS. 
As  strange  as  it  may  seem,  the  authors 
of  PRODAS,  which  had  the  best  running 
times  for  every  problem  Dr.  Teitel 
presented,  feel  that  the  important 
timing  is  the  total  amount  of  time  (both 
human  and  computer)  needed  to  solve  the 
problem. 


This  is  especially  important  as 
computers  become  a  faster  and  cheaper 
resource.  As  new  generations  of 
computers  are  developed,  users  will  be 
able  to  start  up  one  program  and  then 
work  on  another  program.  It  is  better 
to  spend  10  minutes  programming  and  let 
the  computer  run  for  60  minutes  than  to 
program  for  60  minutes  and  let  the 
computer  run  for  10  minutes.  We  feel 
that  as  long  as  the  answer  is  arrived  at 
in  a  timely  fashion,  it  is  better  to 
minimize  programming  time. 

Programming  environments  must  make 
life  easier  for  the  programmer.  We  feel 
that  PRODAS  can  make  life  easier;  but 
since  Dr.  Teitel  tested  running  time,  we 
took  advantage  of  the  structure  of  the 
databases.  I  would  like  to  describe  how 
PRODAS  can  be  used  for  intuitive 
programming,  which  is  fast  and  logical 
to  program  but  runs  slower. 

Dr.  Teitel 's  second  problem  was  to 
produce  a  table  of  driver's  age  versus 
year  of  car  for  trips  of  at  least  three 
days'  duration.  In  addition  to  Dr. 
Teitel 's  problems  2  and  3  discussed 
here,  see  the  Appendix  for  the  solutions 
to  Dr.  Teitel 's  problems  1  and  4.  Most 
of  the  packages  solved  Dr.  Teitel 's 
problem  used  the  fact  that  the  trip 
database  and  the  person  database  were 
sorted  by  house  and  person.  The  two 
databases  are  merged  together  to  produce 
a  temporary  database.  The  temporary 
database  is  resorted  to  match  the  sorted 
order  of  the  car  database.  The 
temporary  database  is  merged  with  the 
car  database  to  find  the  year  of  the 
car.  The  driver's  age  and  the  year  of 
the  car  are  then  tabled. 

The  following  is  the  PRODAS  program 
submitted  to  Dr.  Teitel  to  solve  the 
second  problem: 

program ; 

/* 

Merge  the  trips  database  with  the 
persons  database  to  match  the  car 
id  and  age  of  driver  for  trips  of 
3  or  more  days. 

V 

create  temp; 

merge  trips  persons; 

by  house  person; 

if  duration  >-  3  and  car  <>  *0'  then 
output ; 

keep  house  pers_age  car; 

run; 

prosort ; 

/* 

Module  sorts  a  database.  The 
default  database  is  the  last 
created  database. 

V 


by  house  car; 
run; 

program; 

/* 

Merge  the  car  id  and  driver  age 
with  the  car  database 
to  get  the  year  of  the  car. 

V 

create  temp; 

merge  temp  ( in=intemp)  cars; 
by  house  car; 
if  intemp  then 
output; 

keep  pers_age  car  year; 
label  pers_age  -  Age  of  Person; 
label  car_year  *  Model  Year  of  Car; 
run; 

descrip; 

table  pers_age  car  year; 
title  Number  of  Trips; 
run; 


How  would  we  have  written  the  program 
if  it  was  important  to  minimize 
programming  time? 

PRODAS  has  database  features  that  are 
not  found  in  any  other  software  package 
with  significant  data  analysis 
capabilities.  PRODAS  can  randomly 
retrieve,  edit,  update  and  delete 
records  from  a  database.  PRODAS 
supports  an  unlimited  number  of  keys  per 
database  and  an  unlimited  number  of 
variables  per  key.  By  using  keyed 
databases,  programming  is  greatly 
simplified  because  you  do  not  have  to 
become  involved  in  the  database 
structure . 

Using  keyed  databases  the  PRODAS 
program  to  solve  Dr.  Teitel's  second 
problem  is  greatly  simplified.  The 
intuitive  solution  is  to  read  the  trips 
database,  if  the  trip  is  3  days  or 
longer,  get  the  person  who  drove  the 
car,  and  get  the  car  for  that  trip.  We 
will  not  have  to  sort  any  databases  and 
it  will  not  matter  what  the  database 
order  is. 

The  program  module  has  several 
commands  to  randomly  process  databases. 
The  open  statement  names  the  databases 
that  will  be  processed  randomly.  The 
bread  statement  reads  the  database 
randomly.  Bread  stands  for  B-tree  read 
(binary  read) .  The  1  following  the 
database  name  is  the  key  number.  Since 
PRODAS  can  manage  any  number  of  keys,  it 
is  necessary  to  specify  the  key  number. 

The  new  program  is: 

program; 

create  temp; 

open  persons  cars; 

set  trips; 

if  duration  >-  3  and  car  <>  'O'; 

/*  The  above  if  statement  filters 
records  based  on  the  expression 
V 


bread  persons  1; 

bread  cars  1; 

keep  pers_age  car_year; 

label  pers_age  -  Age  of  Person; 

label  car_year  -  Model  Year  of  Car; 

run; 

descrip; 

table  pers_age  car  year; 
title  Number  of  Trips; 
run; 

The  above  program  is  simpler,  faster 
to  write,  and  is  not  dependent  on  any 
particular  database  ordering.  It  takes 
longer  to  run  because  random  accessing 
is  slower  than  merging  two  order 
databases.  But  the  above  program  took 
very  little  time  to  write  and  is  a  more 
logical  solution.  Because  in  the  future 
we  will  want  to  minimize  programming 
time  in  preference  to  running  time  since 
computers  are  fast  and  cheap,  the  second 
solution  is,  in  general,  a  better 
solution. 

PRODAS  is  the  only  Professional  Data 
Analysis  System  with  capabilities  that 
can  solve  Dr.  Teitel's  problem  for 
either  minimal  computer  time  or  minimal 
programmer  time. 

As  a  second  example  of  how  multikeyed 
databases  can  simplify  a  programming 
task,  we  will  compare  the  minimum 
running  time  and  the  minimum  programming 
time  solution  for  Dr.  Teitel's  third 
problem. 

The  third  problem  required  generating 
a  table  of  each  child's  mother's 
education  versus  father's  education. 

We  submitted  to  Dr.  Teitel  two 
versions  of  the  PRODAS  solution  - 
general  and  specific.  The  general 
solution  assumes  the  data  is  too  large 
to  store  in  memory.  The  specific 
solution  loads  the  data  into  memory. 

The  following  is  the  general 
solution: 

program; 

/* 

This  version  of  the  education 
level  table  generation  program  is 
very  general  and  can  work  on 
databases  of  any  size. 

The  person's  education  is  written 
out  as  either  a  mother's  or 
father's  education  for  merging 
with  descendant. 

The  people's  database  is  then 
sorted  in  mother  order  to  merge 
with  the  mother  database  to  get 
the  mother's  education  level. 

The  people's  database  is  then 
sorted  in  father  order  to  merge 
with  the  father  database  to  get 
the  father's  education  level. 


*/ 


create  mother  (keep  person  ed_years) 
(rename  person-mother 
ed_years«mom_ed) ; 

create  father  (keep  person  ed_years) 
(rename  person-father 
ed_years-dad_ed) ; 

set  people; 
if  sex  -  *1'  then 
output  mother; 
else 

output  father; 
run; 

prosort; 

in-people  out-temp; 

by  mother; 

run; 

program; 

create  temp; 

merge  temp  (in-inpeop) 

mother  (in-inmom); 
by  mother; 
if  inpeop  then 
output ; 

label  mom_ed  -  Mother's  Education; 
run; 

prosort ; 

by  father; 
run; 

program; 

create  temp; 

merge  temp  (in-inpeop) 

father  (in-indad); 
by  father; 
if  inpeop  then 
output ; 

label  dad  ed  -  Father's  Education; 


set  people; 

if  mother  >  0  and  father  >  0; 

fatherl  -  father; 

person  -  mother; 

bread  people  1; 

mom_ed  -  ed_years; 

person  -  fatherl; 

/*  The  fatherl  variable  is  used 
because  the  current  father 
variable  has  the  mother's 
father.  */ 
bread  people  l; 
dad_ed  =  ed_years; 
label  mom_ed  -  Mother's  Education; 
label  dad_ed  -  Father's  Education; 
run; 

descrip; 

format  educate.  mom_ed  dad  ed; 
table  mom_ed  dad_ed  /  missing; 
title  Parent's  Education  Level; 
run; 

As  you  can  see,  the  multikeyed 
solution  is  much  simpler  to  program  and 
is  much  more  intuitive  than  the  "sort 
and  merge"  approach  of  the  other 
solution.  Since  PRODAS  supports  both 
sequential  merging  and  multikeyed  random 
database  accessing,  you  can  decide  if  it 
is  important  to  minimize  running  time  or 
programming  time. 

Appendix 

Solution  for  Dr.  Teitel's  second 
problem. 
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run; 

descrip; 

format  educate.  mom_ed  dad  ed; 
table  momed  dad_ed  /  missing; 
title  Parent's  Education  Level; 
run; 

If  this  same  program  was  written 
using  PRODAS 's  multikeyed  databases,  it 
would  be  very  simple.  The  intuitive 
solution  to  this  problem  is  to  read  each 
person,  find  the  mother's  education 
level,  and  then  find  the  father's 
education  level.  Assuming  the  people 
database  was  keyed  by  the  person's  id, 
we  can  look  up  the  mother  and  father 
randomly. 

program; 

/* 

This  version  of  the  education 
level  table  generation  program 
uses  the  multikeyed  databases. 


This  program  produces  a  table  of 
trips  of  at  least  three  days' 
duration.  The  row  and  column  axes 
are  Age  of  driver  and  Year  of  car. 

*/ 

program; 

/* 

Merge  the  trips  database  with  the 
persons  database  to  match 
the  car  id  and  age  of  driver  for 
trips  of  3  or  more  days. 

V 

create  temp; 
merge  trips  persons; 
by  house  person; 

if  duration  >=  3  and  car  <>  'O'  then 
output ; 

keep  house  pers_age  car; 
run; 

prosort; 

by  house  car; 
run; 


If  the  person's  parents  are 
on  the  database,  the  program 
randomly  reads  the  mother's  record 
and  saves  the  mother's  education 
level.  The  program  then  randomly 
read  the  father's  record  and  saves 
the  father's  education  level. 


create  temp  (keep 
open  people; 


mom  ed  dad  ed) ; 


program ; 

/* 

Merge  the  car  id  and  driver  age 
with  the  car  database  to  get  the 
year  of  the  car. 

V 

create  temp; 

merge  temp  (in-intemp)  cars; 
by  house  car; 
if  intemp  then 
output ; 


keep  pers_age  car_year; 
label  pers_age  -  Age  of  Person; 
label  car  year  ”  Model  Year  of  Car; 
run; 

deecrlp; 

table  pers_age  car  year; 
title  Number  of  Trips; 
run; 

Solution  to  Dr.  Teitel's  fourth  problem: 

program; 

/* 

This  version  of  the  last  births 
table  generation  program 
solves  the  general  case  for  a 
database  of  any  size. 

For  each  person,  a  record  is 
written  with  the  mother  and 
child's  birth  date.  The  new 
database  is  sorted  by  the  mother's 
id  and  child's  birth  year.  The 
youngest  child  is  then  the  last 
record  for  the  mother. 

V 

create  temp  (keep  person  cjairth 
childsex) ; 

set  people; 

if  mother  >  0  then  do; 
person  -  mother; 
cbirth  -  birth; 
childsex  —  sex; 
output ; 
end; 

label  childsex  »  Sex  of  Last  Child; 
run; 

prosort; 

by  person  c_birth; 
run; 

program; 

create  temp; 

merge  people  temp  (in-intemp) ; 
by  person; 

if  intemp  and  last. person; 
momage  -  c_birth  -  birth; 
label  momage  » 

Mother's  Age  At  Lastest  Birth; 

run; 

descrip; 

format  $sex.  childsex; 

format  momage.  mom_age; 

table  mom_age  childsex; 

title  Age  of  Mothers  At  Last  Birth; 

run; 

Since  Dr.  Teitel  was  timing  programs 
for  minimum  running  time,  we  submitted 
two  specific  solutions  for  problems 
three  and  four  that  assumed  numeric  "id" 
values . 

Specific  solution  for  problem  3: 

program; 

/* 

This  version  of  the  education 
level  table  generation  program 
assumes  that  each  person's  mother 
and  father  has  preceded 
them  in  the  input  file.  Also, 


this  program  assumes  that  the 
number  of  people  is  sufficiently 
small  so  the  education  levels  can 
be  stored  in  memory. 

V 

conarray  educate  [1000]; 
create  temp; 
set  people; 

educate [person]  -  ed_years; 
mom_ed  -  educate [mother] ; 
dad_ed  -  educate [ father ] ; 
label  mom_ed  =  Mother's  Education; 
label  dad_ed  =  Father's  Education; 
run; 

descrip; 

format  educate.  mom_ed  dad  ed; 
table  mom_ed  dad_ed-/  missing; 
title  Parent's  Education  Level; 
run; 

Specific  solution  for  problem  4: 

program ; 

/* 

This  version  of  the  last  births 
table  generation  program. 

This  program  assumes  that  the 
number  of  people  is  sufficiently 
small  so  the  education  levels  can 
be  stored  in  memory. 

V 

conarray  mbirth  [1000]; 

conarray  m_oldest[1000] ; 

conarray  msex  $1  [1000]; 

create  temp  (keep  momage  childsex) ; 

set  people  (end=endpeop) ; 

if  sex  -  '1'  then 

m_birth[ person]  «  birth; 
if  m_oldest [mother]  <  birth  then  do; 
m_oldest [mother]  -  birth; 
m_sex[mother]  -  sex; 
end;- 

if  endpeop  then 

for  i  =  l  to  1000  do 

if  m_birth[i]  and  m_oldest[i] 

then  do ; 

mom_age  =  m_oldest[i]  - 

m_birth[i]; 

childsex  -  m_sex [ i ] ; 

output ; 
end; 

label  mom_age  — 

,  w  ,  Mother's  Age  At  Lastest  Birth; 
label  childsex  -  Sex  of  Last  Child; 
run; 

descrip; 

format  $sex.  childsex; 

format  momage.  mom_age; 

table  mom_age  childsex; 

title  Age  of  Mothers  At  Last  Birth; 

run; 


SOLVING  COMPLEX  DATA  MANAGEMENT  PROBLEMS  IN  P-STAT® 


Shirreli  Buhler,  P-STAT,  Inc. 


ABSTRACT 

At  the  13th  Interface,  six  different 
packages  submitted  their  solutions  to 
four  problems  designed  by  Robert  F.  Tei- 
tel  --  two  each  for  two  different  data 
sets.  At  the  18th  Interface,  the  same 
four  problems  were  presented,  this  time 
to  be  run  on  a  micro  computer.  Because 
P-STAT  is  functionally  identical  in  all 
environments,  the  mainframe  solutions 
presented  at  the  13th  Interface  could 
have  been  run  in  P-STAT  on  the  PC  with 
only  those  changes  required  by  differ¬ 
ences  in  the  structure  of  the  raw  data 
files.  However.  P-STAT  has  had  many 
enhancements  in  five  years  and  the  cur¬ 
rent  solutions  are  even  easier  to  program 
and  to  comprehend  than  the  1 98 1  solu¬ 
tions.  The  solutions  presented  here  are 
identical  for  all  the  machines  on  which 
P-STAT  is  supported.  These  solutions 
provide  ample  evidence  that  complex  data 
managment  problems  can  be  solved  on  a 
micro  computer  such  as  a  PC. 


If  empty  households  are  ignored,  the 
first  problem  only  requires  variables 
from  the  Cars  file  and  the  Persons  file. 
One  approach,  if  this  were  the  only  prob¬ 
lem,  would  be:  1)  aggregate  the  Cars  file 
creating  a  file  of  summary  records  for 
each  household;  2)  aggregate  the  Persons 
file  (selecting  people  over  16)  creating 
a  second  file  of  summary  records  for  each 
household;  3)  join  the  two  household  sum¬ 
mary  files;  and  4)  tabulate  the  results 
of  the  join.  These  files  are  all  P-STAT 
system  files. 

However,  there  are  two  problems  to 
solve.  A  single  sort  can  be  used  to 
arrange  the  Cars,  Persons,  and  the  Trips 
data  (needed  for  the  second  problem), 
into  a  working  file  appropriate  for  both 
problems.  Both  tabulations  can  then  be 
done  in  a  single  step  although  they  are 
shown  here  as  separate  steps  for  the  sake 
of  clarity. 

2.  BUILDING  THE  FIRST  DATABASE 


1.  PROBLEMS  FOR  THE  FIRST  DATABASE 

The  first  two  problems  involve  four 
files  which  taken  together  describe  a 
system  with  a  hierarchical  structure. 

H  A  household  record  with 

C  0  to  9  car  records,  and 
P  0  or  more  person  records: 

a  person  record  may  own  many 
T  trip  records. 

The  household  Id  number  is  contained 


The  commands  to  create  four  P-STAT 
system  files  from  the  four  raw  data  files 
are  very  similar.  The  command  to  create 
the  Cars  file  is: 

BUILD  Cars,  FIXED.  FILE  Cars.Dat, 
LENGTH  120; 

VARS 

Household. Id  1-4  Record. Type  5 

Car. Number  6-7  Cl  TO  C51  10-60 

CC1  TO  CC20  61-100  C41  TO  C44  101-116 

Model. Year  117-120  (ALLOW  1900  TO  1986)$ 


in  all  records.  A  car  number  provides  a 
link  between  a  car  record  and  a  trip 
record.  A  person  number  provides  a  link 
between  a  person  record  and  a  trip 
record.  For  the  purposes  of  this  exer¬ 
cise  the  only  variables  of  interest  aside 
from  the  linking  variables  and  the  record 
type  are: 

Cars  file  model  year  of  the  car 

Persons  file  age  of  person 

Trips  file  duration  of  the  trip 

The  record  types  are:  2=cars, 

3=persons  and  4=trips.  The  household 
file  is  needed  only  if  a  count  of  the 
empty  households  (no  people  or  cars)  is 
desired. 

The  first  problem  is  to  produce  a 
table  with  counts  of  the  number  of  people 
over  the  age  of  16  by  the  number  of  cars 
in  a  household.  The  second  problem  is  to 
produce  a  table  of  age  of  person  by  model 
year  of  car  for  trips  of  at  least  three 
days  duration. 


The  next  step  combines  the  records 
from  the  Cars,  Persons  and  Trips  files 
into  a  single  file  in  the  desired  order. 
The  Household  file  was  omitted  from  this 
step  so  that  houses  with  neither  cars  nor 
people  would  not  be  included  in  the  com¬ 
putations. 

P-STAT  permits  multiple  files  to  be 
dynamically  concatenated  as  they  are 
input  to  any  command.  New  variables  can 
be  created,  existing  variables  can  be 
recoded,  and  adjacent  cases  can  be  com¬ 
bined  as  the  records  from  each  file  are 
processed.  Here,  the  SORT  command  is 
used  with  three  input  files. 

When  files  are  dynamically  concat¬ 
enated  using  the  plus  (  +  )  operator,  all 
cases  must  ultimately  have  the  same  vari¬ 
ables  in  the  same  order.  As  each  case  is 
processed,  variables  that  are  not  found 
in  the  current  file  are  created  and  set 
to  missing.  Finally  a  variable  rear¬ 
rangement  (KEEP)  is  done  so  that  all  the 
cases  in  all  the  files  have  the  same 
variables  in  the  same  order. 

The  resulting  output  file  has  all  the 
records  for  a  household  together.  The 
car  records  are  first  because  Person. Id 
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was  set  to  0  in  the  SORT  step.  Each  Per¬ 
son  record  is  followed  by  zero  or  more 
trip  records.  This  file  can  now  be  used 
for  either  of  the  crosstabulations. 


SORT  Cars 

(  GENERATE  Duration 
GENERATE  Person. Id 
GENERATE  Age 
(  KEEP  Household. Id 
Car . Number 
Duration 

♦  Persons 

(  GENERATE  Car. Number 
GENERATE  Model. Year 
GENERATE  Duration 
(  KEEP  Household. Id 
Car. Number 
Duration 

+  Trips 

(  GENERATE  Age  =  .Ml. 

GENERATE  Model. Year 
(  KEEP  Household. Id 
Car . Number 
Duration 


=  .Ml., 

=  0, 

=  .Ml.  ) 

Record. Type 
Model .Year 
Person. Id  Age  ) 

=  .Ml., 

=  .Ml., 

=  .Ml.  ) 

Record. Type 
Model. Year 
Person. Id  Age  ) 


.Ml.  ) 

Record. Type 
Model .Year 
Person. Id  Age), 


BY  Household. Id 
OUT  Trips2  $ 


Person. Id  Record. Type, 


Because  comparative  timings  were  to  be 
done,  the  single  SORT  using  dynamic  con¬ 
catenation  was  used  rather  than  the  fol¬ 
lowing  two  step  procedure. 

CONCAT  Cars 

(  KEEP  Household. Id  Record. Type 
Car. Number  Model. Year  ) 

(  GENERATE  Person. Id  s  0  ) 


Persons 
(  KEEP 

Trips 
(  KEEP 


Household. Id 
Person. Id 

Household. Id 
Person. Id 


Record. Type 
Age  ) 

Record. Type 
Duration  ) , 


OUT  Trips2  $ 

SORT  Trips2 , 

BY  Household. Id  Person. Id 
Record. Type, 

OUT  Trips2 ,  REPLACE  $ 

The  use  of  CONCAT  followed  by  SORT 
requires  two  passes  through  the  data 
file.  The  two  step  procedure  is  easier 
to  program  and  the  possibilty  of  an 
alignment  error  in  the  common  variables 
is  eliminated  because  CONCAT  does  the 
alignment  automatically.  When  timings 
and  disk  space  considerations  are  not 
important,  the  two  step  procedure  is 
clearly  preferable. 

2.1  Problem  1 

Obtaining  the  frequencies  for  the  num¬ 
ber  of  cars  in  a  household  by  the  number 
of  persons  over  16  is  done  in  a  single 
step  using  the  TABLES  command  and 


on-the-fly  aggregation  in  P-STAT's  pro¬ 
gramming  language. 

TABLES  Trips2 

(  IF  FIRST  (  Household. Id  ), 

GENERATE  #Cars  =  0, 

GENERATE  fPersons.Over. 16  =  0  ) 

(  IF  Record. Type  =  2,  INCREASE  fCars  ) 

(  IF  Record. Type  =  3  AND  Age  >  16, 
INCREASE  #Persons.0ver.16  ) 

(  IF  LAST  (  Household. Id  )  CONTINUE  ) 

(  KEEP  #Cars  fPersons.Over. 16  )  ; 

TABLE  'Household  Count' 

Cars  BY  Persons. Over. 16  $ 

In  this  example,  the  number  of  cars 
and  number  of  persons  over  16  are  com¬ 
puted  as  the  file  is  given  to  the  TABLES 
command.  FIRST  and  LAST  permit  a  test 
for  the  start  and  end  of  a  given  house¬ 
hold.  Scratch  variables  (fCars  and 
fPersons.Over.  16)  are  set  to  zero  a3  the 
first  record  for  each  household  is  pro¬ 
cessed.  fCars  is  increased  each  time  a 
car  record,  Record. Type  =  2,  is  read. 
fPersons.Over  16  is  increased  when 
Record. Type  equals  3  and  Age  is  greater 
than  16. 

Only  a  single  record  for  each  house¬ 
hold  is  actually  sent  to  the  TABLES  com¬ 
mand.  That  record  contains  just  the  two 
variables,  Cars  and  Persons. Over. 16,  that 
are  needed  for  the  tabulation. 

Household  Count 


Persons 

.Over 

.16 

Row 

Cars 

0 

1-2 

3+ 

Totals 

0 

!  2  ! 

1  > 

1  1 

5  I 

1 

1 

11  ! 

j 

j  37 

1-2 

!  10  ! 

1  1 

1  1 

15  I 

1 

1 

17  ! 

1 

!  42 

i 

3+ 

I  9  ! 

1  1 

1  1 

15  1 

1 

1 

18  ! 

!  42 

i 

Total  N 

21 

35 

46 

102 

NOTE:  There  were  two  households  with 
person  records  that  had  neither  car3  nor 
adults.  This  table  was  interactively 
post-processed  and  relabelled  within  the 
TABLES  command  so  that  it  would  fit 
within  the  constraints  of  this  two  column 
layout. 
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2.2  Problem  2 

This  second  problem  requires  more  from 
a  package  than  simple  aggregation.  Both 
Cars  and  Persons  are  at  the  same  level  of 
the  hierarchy.  Trips  are  associated  with 
Persons  and  only  indirectly  with  Cars. 

After  the  SORT,  all  the  car  records 
for  a  household  precede  any  person 
records  and  a  person  record  precedes  each 
set  of  trip  records.  It  is  necessary  to 
store  both  the  model  years  for  up  to  nine 
household  cars  and  the  age  from  the  per¬ 
son  record  until  a  trip  record  for  that 
person  is  read.  Again,  this  can  be  done 
in  the  P-STAT  programming  language  as  the 
file  is  given  to  the  TABLES  command. 

As  each  car  record  is  read  the  model 
year  is  stored  in  the  permanent  (P)  vec¬ 
tor.  The  model  year  for  car  number  1  is 
stored  in  P(1),  the  model  year  for  car 
number  2  is  stored  in  P(2),  etc. 

As  each  person  record  is  read,  varia¬ 
ble  Age  is  stored  in  the  scratch  variable 
//Persons. Age.  The  P  vector  and  scratch 
variables  are  used  for  aggregation  and 
for  moving  values  from  one  record  to  a 
subsequent  record. 

TABLES  Trips2 

(  IF  Record. Type  =  2, 

SET  P(Car. Number)  =  Model. Year  ) 

(  IF  Record. Type  =  3, 

GENERATE  //Persons .  Age  =  Age  ) 

(  IF  Record. Type  =  4  AND 
Duration  >=  3  AND 
//Persons. Age  AMONG  (  17  to  99  )  AND 


that  is  given  to  TABLES  contains  two 
variables,  Persons. Age,  derived  from  the 
scratch  variable  //Persons.  Age,  and 
Year. of .Car. 

Trips  in  Household  Cars 
By  Family  Members  Over  16 
Lasting  3  or  More  Days 

Year. of .Car 


Persons 

Age 

1971- 

1974 

1 975- 
1978 

1 979- 
1982 

Row 

Totals 

17-23 

39 

46 

47 

132 

24-30 

68 

46 

42 

156 

31  + 

23 

28 

40 

91 

Total  N 

130 

120 

129 

382 

NOTE:?  this  table  was  also  interac¬ 
tively  post-processed  and  relabelled 
within  the  P-STAT  TABLES  command  to  for¬ 
mat  the  table  so  that  it  would  fit  within 
the  2-column  layout. 

3.  PROBLEMS  FOR  THE  SECOND  DATABASE 

The  second  data  set  contains  records 
of  a  geneological  nature.  Variables  in  a 
person's  record  provide  Id  numbers  for 
that  person's  father  and  mother.  Other 
variables  contain  information  such  as 
sex.  education  and  year  of  birth.  Both 
of  the  problems  for  this  data  set  require 
information  from  a  given  person  record  to 
be  linked  with  information  from  the 


car.  This  is  done  in  the  programming 
language  by  using  an  IF  with  a  series  of 
AND's.  The  records  for  which  this  IF 
test  is  true  are  the  only  records  that 
the  TABLES  command  receives.  Any  records 
that  have  missing  values  on  any  of  the 
variables  tested  in  the  IF  statement  are 
automatically  excluded. 

Model  year  is  moved  into  the  selected 
trip  record  from  the  P  vector.  If  the 
trip  is  in  car  number  3,  P(Car. Number) 
references  P(3)  which  is  the  location 
where  the  model  year  for  the  third  car  in 
the  household  is  stored.  The  trip  record 


Here  6  is  the  mother  of  both  1  and  2  and 
7  is  their  father,  8  is  the  mother  of  3 
who  has  no  father  record  in  the  file. 

The  third  problem  is  to  produce  a 
table  of  mother's  education  by  father's 
education.  This  requires  that  informa¬ 
tion  from  three  records  in  the  file  be 
simultaneously  available.  The  fourth 
problem,  which  is  to  produce  a  table  of 
mother's  age  by  sex  of  child  at  last 
birth,  requires  that  the  mother  informa¬ 
tion  be  combined  with  information  from 
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Car. Number  >  0,  CONTINUE  ) 

records 

of  his 

parents. 

A 

zero  in  the  ' 

Mother. Id  or  Father. Id  fields 

indicates  a  ' 

( 

GENERATE  Year. of. Car  =  P( Car . Number ) ) 

mother  or  father 

record 

that 

is 

not  pre-  j 

( 

KEEP  # Persons. Age,  Year. of. Car  ); 

sent 

in 

the  file 

• 

TABLE  1  Trips  in  Household  Cars  ' 

ID 

SEX 

EDUCA 

YEAR  OF 

MOTHERS 

FATHERS 

'  By  Family  Members  Over  16  ' 

TI0N 

BIRTH 

ID 

id  ; 

'  Lasting  3  or  More  Days  ' 

1 

1 

16 

1929 

6 

/ 

7  i 

Persons. Age  BY  Year. of. Car, 

2 

2 

12 

1935 

6 

7  I 

EDGES  TL  $ 

3 

1 

12 

1932 

8 

0 

4 

2 

8 

1905 

15 

17 

Each  trip  record  is  examined  to  see 

5 

2 

10 

1910 

15 

17 

if: 

1)  duration  is  at  least  three  days, 

6 

2 

12 

1900 

33 

44 

and 

;  2)  the  current  person  age  is  over 

7 

1 

12 

1898 

0 

0  i 

16, 

and;  3)  the  trip  was  in  fact  made  by 

8 

2 

8 

1910 

0 

56  ' 

p.vrvTi 


4.1  Problem  3 


each  of  her  children's  records  and  that 
information  for  all  of  her  children  be 
scanned  to  locate  the  youngest  child. 

4.  BUILDING  THE  SECOND  DATABASE 

Even  though  the  second  data  set  is  a 
single  file,  it  is  tricky  because  infor¬ 
mation  from  several  non-adjacent  records 
needs  to  be  simultaneously  available. 
Three  steps  are  required  before  the  tabu¬ 
lations  can  be  done. 

BUILD  People,  FILE  People.dat, 

FIXED,  LENGTH  120  ; 

VARS 

Person.  Id  1-5  Sex  6 
Education  57-58  Birth. Year  87-90 

Mother. Id  111-115  Father. Id  116-120  $ 

$ 

SEPARATE  People  (  KEEP  Person. Id 

Birth. Year  Education  Sex  ), 
OUT1  Mothers,  OUT2  Fathers,  EXTRA  $ 

LOOKUP  People 

(  KEEP  Person. Id  Mother. Id 
Father. Id  Birth. Year  Sex  ) 

(  IF  Mother. Id  <  1  AND 

Father. Id  <  1 ,  EXCLUDE  )  , 
TABLE  Mothers 

(  RENAME  Person. Id  TO  Mother. Id, 
RENAME  Birth. Year  TO  M. Birth. Year , 
RENAME  Education  TO  M. Education  ) 

Fathers  (  DROP  Birth. Year  ) 

(  RENAME  Person. Id  TO  Father. Id, 
RENAME  Education  TO  F. Education  ), 


With  the  parent  information  available 
on  all  the  records,  it  is  only  necessary 
to  recode  education  into  education  groups 
and  tabulate  those  groups.  The  recode  is 
done  in  the  P-STAT  programming  language 
as  the  file  is  passed  to  the  TABLES  com¬ 
mand. 

TABLES  People. Parents 

(  SET  M. Education  =  RECODE 
(  M. Education,  0  TO  12=1,  13  TO  16=2, 
17  TO  24=3,  X=4  ), 

SET  F. Education  =  RECODE 
(  F. Education,  0  TO  12=1,  13  TO  16=2, 
17  TO  24=3,  X=4  )); 

LABELS  M. Education  (1)  No  College 

(2)  College  (4)  Graduate  Work  / 

F. Education  /, 

TABLE 

'Mothers  Education  by  Fathers  Education' 

'Offspring  Count* 

M. Education  BY  F. Education  $ 

Because  both  variables  require  the 
same  modification,  a  FOR  loop  can  be  used 
instead  of  two  separate  modifications. 
Since  the  modification  is  sequential,  the 
NCOT  function  which  supplies  cutting 
points,  can  replace  the  RECODE. 

(  FOR  (  J:  M. Education  F. Education) , 

SET  V(J)  =  NCOT  (  V(J),  12,  16  )); 


OUT  People . Parents  $ 

The  first  step  creates  a  P-STAT  system 
file  from  the  raw  data.  The  second  step 
creates  a  file  of  possible  mother  records 
and  a  file  of  possible  father  records. 

This  is  done  in  the  SEPARATE  command 
which,  when  provided  with  the  EXTRA  iden¬ 
tifier,  uses  the  rightmost  variable  (Sex) 
to  determine  the  correct  output  file  for 
each  case.  That  extra  variable  is  not 
included  in  the  output  file.  The  third 
step,  uses  the  LOOKUP  command  to  join 

mother  and  father  information  to  each 

record  in  the  People  file. 

In  1981  the  LOOKUP  command  did  not 

exist  and  the  solution  to  these  problems 
required  several  steps.  Parent  files 
were  created;  the  child  file  was  sorted 
by  Father. Id  and  joined  to  the  file  of 
father  data;  that  file  was  then  sorted  by 
Mother. Id  and  joined  to  the  file  of 
mother  data.  In  1986,  four  steps  have 
been  replaced  by  one.  The  result  is 
fewer  passes  through  the  data  file,  and 
the  solution  is  far  easier  to  program  and 
to  program  correctly  the  first  time. 


This  is  very  easy  to  program  but  is  not 
as  self-explanatory  as  the  individual 
RECODEs. 

Mothers  Education  by  Fathers  Education 


Offspring  Count 

F. 

Education 

No 

Grad 

M. 

Coll 

Coll 

uate 

Row 

Education 

ege 

ege 

Work 

Totals 

No  College 

351 

i  111 

1 

1 

462 

College 

!  42 

1 

1 

117 

159 

Graduate 

Work 

123  I  ! 

1  1 

1  1 

123 

Total  N 

474 

153 

117 

744 

4.2  Problem  4 

For  the  final  problem  it  is  necessary 
to  look  at  all  the  children  for  a  given 
mother  and  find  the  case  with  the  most 
recent  birth  date. 

TABLES  People. Parents 

(  COLLECT  20,  BY  Mother. Id, 

SORT  Birth. Year  (D), 

CARRY  M. Birth. Year  ) 

(  GENERATE  Age. Last. Birth  =  NCOT 
(  Birth. Year. 1  -  M. Birth. Year, 

18,  25,  30,  35  )) 

(  GENERATE  Sex . Last. Born  =  Sex.1  ); 

LABELS 

Sex. Last. Born  (1)  male  (2)  female  / 

Age. Last. Birth( 1 )  18  and  under  (2)  19-25 
(3)  26-30  (4)  31-35 

(5)  40  and  over  /, 

T  "  Mother's  Age  at  Last  Childbirth  " 

"  By  Sex  of  Child  " 

Age. Last. Birth  BY  Sex. Last. Born  $ 

The  ability  to  COLLECT  a  number  of 
adjacent  cases  and  combine  them  into  a 
single  case  provides  the  easiest  solution 
to  this  problem.  There  may  be  up  to  20 
children  per  mother.  The  BY  and  CARRY 
variables  are  stored  only  once  in  the 
collected  case  because  they  are  the  same 
for  all  children  of  a  given  mother. 

Other  variables  are  stored  20  times,  once 
for  each  possible  child,  with  a  suffix 

added  to  the  variable  name  so  that  each 
will  be  uniquely  addressable.  Thus  Sex.1 
contains  the  sex  of  the  first  child  in 
the  collected  case. 

SORT  requests  that  a  case  be  stored 
within  the  collected  case  in  its  sort 
order  on  one  or  more  variables.  By  spe¬ 
cifying  a  sort  on  Birth. Date  and  using 
(D)  to  indicate  a  downwards  sort,  the 
variables  for  the  youngest  child  are 
placed  first  in  the  collected  case. 
Sex.1  is,  therefore,  the  sex  of  the  last 
born.  Mother's  age  is  the  difference 
between  Birth. Date. 1 ,  the  birth  date  of 
the  last  born,  and  the  mother's  birth 
date. 

The  COLLECT  function  did  not  exist  in 
P-STAT  in  1981.  However,  this  problem 
can  also  be  solved  by  using  FIRST,  LAST, 
and  comparing  the  birth  date  of  each 
child  in  turn  to  the  value  of  a  scratch 
variable  which  is  reset  each  time  a  new 
youngest  child  is  found. 

With  COLLECT  the  entire  procedure 
takes  three  simple  programming  language 
statements.  COLLECT  and  SPLIT  makes  it 
possible  to  solve  problems  which  are  oth¬ 
erwise  very  difficult,  even  with  FIRST, 
LAST,  and  scratch  variables. 


Mother's  Age  at  Last  Childbirth 
By  Sex  of  Child 

Sex. Last. Born 


Age  Last 

Birth 

male 

female 

Row 

Total 

18  and  under 

4 

I  2 

1 

1 

6 

19-25 

14 

!  19 

1 

33 

26-30 

25 

!  17 

| 

42 

31-35 

47 

I  85 

1 

1 

132 

Total  N 

90 

123 

213 

5.  SIZE  ISSUES 


The  raw  data  files  for  both  data  sets 
contain  many  extra  data  values.  They 
were  included  to  see  if  the  packages 
could  handle  large  numbers  of  variables 
and  to  ascertain  how  much  disk  space  the 
resulting  files  would  need.  P-STAT,  like 
the  other  packages  in  the  test,  had  no 
trouble  handling  the  complete  data  set. 
However,  because  P-STAT  U3es  a  very 
aggressive  algorithm  for  packing  data 
values,  the  resulting  P-STAT  system  files 
required  less  disk  storage  than  the  files 
produced  by  the  other  packages. 

6.  CONCLUSIONS 

The  session  was  titled  "Benchmarking 
Vendor  Packages".  Given  that  all  the 
vendor  packages  could  solve  the  problems, 
there  is  some  question  about  what  was 
actually  benchmarked  and  what  the  bench¬ 
marks  mean.  Three  areas  that  can  be  com¬ 
pared  are:  1)  ease  of  use;  2)  speed;  and 
3)  use  of  resources. 

Ease  of  use  is  difficult  to  evaluate. 
The  best  measure,  short  of  a  carefully 
designed  experiment  with  novice  users,  is 
subjective:  how  easy  is  it  to  follow  the 
command  stream  without  reading  the 
explanatory  material.  This  will,  to  a 
large  extent,  depend  on  the  reader's 
background  and  familiarity  with  packages. 
P-STAT  in  1986  has  a  language  that  is 
easier  to  use  than  P-STAT  in  1981.  This 
is  particularly  evident  in  the  COLLECT 
function  and  the  LOOKUP  command.  This  is 
a  trend  that  will  certainly  continue. 

In  some  of  the  problems  illustrated 
here,  the  most  readable  sequence  of  com¬ 
mands  is  not  the  fastest.  Ease  of  use 
and  speed  are  two  areas  which  should 
probably  not  be  benchmarked  simultane¬ 
ously.  P-STAT's  timings  were  acceptable, 
as  were  those  of  most  of  the  other  pack¬ 
ages. 
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The  resources  needed  to  run  a  program 
can  be  measured.  P-STAT  on  a  PC  requires 
640K,  a  math  co-processor,  and  a  hard 
disk.  This  puts  P-STAT  near  the  high  end 
in  terras  of  PC  requirements.  However, 
P-STAT  is  modularized  so  that  not  all  of 
the  program  needs  to  be  installed  on  the 
hard  disk.  In  addition,  P-STAT' s  system 


files  require  less  disk  storage  than 
other  packages  because  of  the  aggressive 
packing  algorithms  that  are  used. 

The  PC  is  a  viable  tool  for  complex 
data  management  problems.  Data  sets  both 
larger  and  more  complex  than  these  prob¬ 
lem  data  sets  can  be  handled  easily  by 
packages  such  as  P-STAT. 


VOLUME  TESTING  WITH  THE  PC/SAS®  SYSTEM 

by  Katherine  Ng,  SAS  Institute  Ine. 


Abstract 

This  paper  discustet  the  «#«  0/  the  PC/SAS  tyttem  to  tolve  the  4 
problemt  posed  bp  Dr.  Robert  Teitet.  The  problemt  are  designed 
to  test  the  complex  data  manipulation  capabilities  of  the  ilatitti- 
cal/databare  rprtemi  currently  available  to  the  PC  meet.  With  the 
PC/SAS  system,  the  t abler  requested  in  the  4  problemt  can  ent¬ 
ity  be  generated  by  a  SAS  tabulation  procedure  tuch  at  FREQ  or 
TABULATE  ,  after  the  appropriate  data  bates  have  been  created. 


Problem  Description* 

The  first  2  problems  use  the  TRIPS  data  collection.  It  has  4  com¬ 
ponents,  which  are  henceforth  referred  to  as  the  household,  ear, 
person,  and  trip  records.  Each  household  has  a  variable  number  of 
cars  and  a  variable  number  of  persons.  Each  person  of  a  household 
took  a  variable  number  (0  to  99)  of  trips.  Neither  the  number  of 
persons  nor  the  number  of  cars  in  a  household,  nor  the  number  of 
trips  taken  by  a  person  is  explicitly  coded.  The  person  record  con¬ 
tains  the  age  of  the  person  and  his  household  identification.  The 
car  record  has  the  model  year  of  the  car  and  identifies  the  house¬ 
hold  to  which  it  belongs.  The  trip  record  identifies  the  person  of 
a  household  who  took  the  trip,  the  duration  of  the  trip,  and  the 
car  identification  code  if  a  family  car  was  used.  All  records  are  in 
household  identification  order. 

Two  tabulations  are  requested.  The  first  is  a  frequency  distribu¬ 
tion  of  households  for  the  number  of  cars  owned  by  the  number 
of  persons  over  the  age  of  16  in  the  household.  The  second  is  a 
frequency  distribution  of  trips  of  at  least  3  days'  duration  taken  in 
a  car  owned  by  the  household  by  the  age  of  the  person  taking  the 
trip  and  by  the  model  year  of  the  car. 

The  last  2  problems  use  the  PEOPLE  data.  Each  record  in  the 
PEOPLE  data  collection  contains  a  person's  identification  code, 
sex.  birth  year,  level  of  education,  and  the  parents’  identification 
codes  if  known.  Records  are  in  identification  order. 

Again.  2  tabulations  are  requested.  The  first  is  a  frequency  distri¬ 
bution  of  offspring  by  the  education  level  of  each  parent,  and  the 
second,  a  frequency  distribution  of  the  “last  births”  by  the  age  of 
the  mother  at  the  birth  of  her  last  offspring  and  by  the  sex  of  that 
last  offspring. 

The  TRIPS  Data 

With  the  variability  of  the  number  of  persons  in  a  household,  the 
number  of  cars  belonging  to  a  household,  and  the  number  of  trips 
taken  by  a  person  of  a  household,  it  is  best  to  keep  the  data  in  sepa¬ 
rate  components.  SAS  system  files  for  the  different  components  are 
created  for  efficient  retrieval  in  later  aualysis.  To  assess  PC/SAS 
ability  to  handle  data  for  a  large  number  of  variables,  all  given  data 
are  retained  in  the  system  files.  Working  SAS  data  sets  with  only 
the  relevant  variables  and  cases  are  created  to  gather  information 
from  the  component  files.  These  working  data  sets  are  then  passed 
to  a  tabulation  procedure  to  generate  the  desired  tables. 

Figure  I  has  the  SAS  statements  that  build  the  4  component  files 
for  the  TRIPS  data  base  and  extract  subsets  of  data  into  working 
data  sets. 


/* . . . . . . •/ 

/* - Create  SIS  data  aata - */ 

/« .  HOUSES,  CARS,  PERSONS,  TEIPS  - •/ 

/. - - - . . »/ 

DATA  a. houses; 

lnfil*  "houses. dat*  lrecl=120; 
input  hen**  $1-4  type  $6 

•10  (on*l-on*Sl)(tcharl.) 
(t«ol-tvo20) ($char2 . ) 

(f onrl-f ears) (IcharA . ) ; 

rnn; 

DATA  a. car*; 

infil*  "care.dat*  lrecl-120; 
input  hous*  $1-4  typ*  $6  car  6-7 
■10  (onel-en*Sl)($charl.) 
(twol-two20) ($char2 . ) 

(Tourl-f our4) ($char4 . ) 
year  117-120; 

run; 

DATA  a. persons; 

infil*  "persons.dat*  lr*cl«120; 
input  houie  $1-4  typ*  $6  person  $6-7 
•10  (on*l-on*El)($charl.) 
age  61-62 

(tvo2-tvo20)($char2. ) 

(f eurl-f eurS) ($char4 . ) ; 

rnn; 

DATA  a. trip*; 

infil*  "trips.dat*  lr*el*120; 
input  hou*e  $1-4  type  $B 

person  $6-7  trip  $8-6  car  10 
(on*2-onaSl )  ($charl . ) 
(tvol-tvolB) ($char2. ) 
day*  60-100 

(fourl-fourS) ($char4 .) ; 

run; 


/* . - . */ 

/*  Create  tesperary  data  sets  :  */ 
/*  CARS,  PERSOHS,  and  TRIPS  •/ 
/•  Extract  relevant  variable*  and  case*  */ 
/* . . . •/ 


DATA  care; 

eet  a.cars(keep>hous*  car  year); 
run; 

DATA  person*; 

eet  a. persons (keep*hous*  person  age); 
run; 

DATA  tripi(drop-daye) ; 

eet  a.trips(k**p*heus*  pereon  car  days); 
if  car  and  daye  >•  3  then  output; 
run; 


A  Solution  To  Problem  1 

Figure  2  shows  how  the  different  components  of  the  TRIPS  data 
base  are  merged  to  create  a  temporary  SAS  data  file  that  contains 
only  the  information  needed  for  table  1.  The  frequency  distribution 
of  the  number  of  households  by  the  number  of  cars  owned  and  by 
the  number  of  persons  over  the  age  of  16  in  the  household  is  easily 
generated  by  PROC  FREQ.  Figure  3  has  the  codes  that  generated 
the  table  shown  in  figure  4.  Note  that  the  data  have  been  classified 
into  a  few  groups  to  produce  a  more  pleasing  and  compact  table. 
In  the  SAS  system,  the  grouping  of  data  is  easily  done  by  defining 
a  format  variable  with  PROC  FORMAT  and  using  this  format  in 
the  appropriate  procedure  for  the  appropriate  variables. 


/• . . •/ 

/*  Craata  temporary  data  aat  HHOLDS  of  •/ 
/•  honaahold  raeorda,  with  variables  •/ 
/•  nears  ((  of  ears  in  tha  honaahold)  •/ 
/•  ovarlfi (#  of  paraona  ovar  ago  id)  •/ 
/* . */ 


Dili  HHOLDS (keep  •  nears  ovarlfi) ; 

narge  a. houses (in=h  kaep*honae) 
persona 
ears; 
by  honsa; 

length  lparaon  12; 

retain  lcar  lparaon; 

nears  ♦ Clear  *»  ear); 

ovarlfi* ((lparaon  "»  paraon)*(aga>10)) ; 

lcar*ear;  lparaon  3  parson; 

if  last. honsa; 

if  h  than  ontput; 

ncarn  •  0;  ovarlfi  =  0; 

lcar  «  .  lparaon  *  *  •; 

rnn; 


fig.  2. 


/* . •/ 

/*  Sat  np  format  for  tha  connt  variables  */ 

/* . . */ 

PROC  FORMAT; 

valna  eonntfat  0  «  "0* 

t-2  =  "1  -  2" 

3-high  *  *3**; 

rnn; 

/ . */ 

/•  Canarata  tha  table  for  problaa  1  */ 

/* . */ 

TITLE3 


"Frequency  Distribution  of  Households’; 
PROC  FREQ; 

tables  nears'everlfi/ 

nocol  norow  nopsresnt ; 
foraat  nears  ovarlfi  countfat.; 


VOLUME  TESTIHG  -  TRIPS  DATA 
Frequency  Distrlbntios  of  Hessabolds 

TABLE  OF  HCARS  BT  OVERlfi 

HCARS  OVERlfi 


Fraqnaney 1 

01 

1  -  21 

3*1 

Total 

0  1 

6  1 

6  1 

U  1 

21 

1-21 

10  1 

16  1 

17  1 

42 

3*  1 

9  1 

16  1 

18  1 

42 

Total 

24 

36 

46 

106 

fig.  4. 


A  Solution  To  Problem  2 

Problem  2  also  requires  that  information  be  gathered  from  across 
the  different  components  of  the  data  base.  For  each  trip,  we  need 
the  model  year  of  the  car  used  in  the  trip  from  the  CARS  file,  and 
the  age  of  the  driver  from  the  PERSONS  file.  With  a  series  of 
sorting  and  merging  steps,  we  can  collate  all  that  information  into 
one  data  file,  which  we  pass  on  to  PROC  FREQ  to  generate  the 
desired  table. 

The  PEOPLE  Data  Bane 

The  last  2  problems  need  a  data  manipulation  scheme  quite  differ¬ 
ent  from  the  first  2  problems.  The  solutions  to  problems  1  and  2 
require  accessing  information  across  the  components  of  each  record 
of  the  data  base,  whereas  solutions  to  problems  3  and  4  require  col¬ 
lating  information  from  the  different  records  of  the  same  data  file. 

Again,  a  SAS  system  file  is  created  for  the  PEOPLE  data  with  all 
the  variables  kept.  Retaining  the  data  for  all  the  75  variables  is  not 
necessary  for  problems  3  and  4,  but  it  has  been  done  for  complete¬ 
ness.  A  working  SAS  data  set  with  only  the  relevant  variables  and 
cases  is  created  to  facilitate  multipassing  of  the  data.  See  figure  5. 


. . 

/•  Craatn  a  purminunt  SAS  data  aat  PEOPLE.  */ 

. . . 

DATA  a.paopla;  inf 11a  "paopla.dat"  lrael*120; 
Input  id  11-6  sex  tfi 
(onal-onaSO) (ficharl .)  adue  67-68 
(tYol-t*ol4) (|ehar2.)  blrthyr  87-90 
(f onrl -f onr6) (8 char4 . ) 
aoa  1111-116  dad  (llfi-120; 
rnn ; 

/* . *( 

/*  Extract  anbaat  of  ralavant  varlablaa  */ 

/* . - . */ 

DATA  paopla 

aat  a.paopla 

(koap-ld  aax  adue  blrthyr  aoa  dad); 


I 

I 


( 


,  A  Solution  to  Problem  S 

'  Problem  3  arks  for  a  frequency  distribution  of  tbe  number  of  off¬ 

spring  by  tbe  education  level  of  tbe  parents.  We  note  from  tbe  data 
description,  tbe  only  information  about  tbe  parents  contained  in 
each  offspring's  record  is  tbe  parents'  identification  codes.  Thus, 
the  solution  necessitates  locating  the  parents  records  (if  available) 
,  for  each  offspring,  and  collating  their  education  information  onto 

t  the  offspring's  record.  Figure  6  shows  the  steps  involved.  With  the 

i  information  about  the  parents'  education  collated  on  the  offspring's 

*  record,  the  table  for  problem  3  is  easily  produced  by  PROC  FREQ. 


i 

I 

I  /» - */ 

/•  Extract  anbaata  consisting  of  mala  •/ 
/«  racorda  for  tha  FATHER'S  flla  and  •/ 

/•  f oasis  racorda  for  MOTHEI'a  flla  •  / 

/* . •/ 


I  DATA  father (ranaao*(id*dad  adnc«dad.ad) 

keep=ld  adne) 

l  Bother Cransxa^C id os  aduc*aea_ad 

birthyr=aoaborn) 
kssp*id  sduc  blrthyr) ; 
aat  people j 

If  aax  •  * l*  than  output  aothsr; 
if  aax  «  *3*  than  output  father; 

I  run; 


/•  Create  tha  file  FAMILY  by  collating  tha  »/ 
/«  fathar'a  record  and  tha  aothar'a  record  •  / 


/*  into  tha  offapring'a  record  •/ 

/♦ . */ 

/•  aort  PEOPLE  by  dad 'a  Id  to  natch  up  •/ 
/*  dad ‘a  Id  with  hie  own  record  froa  tha  */ 
/•  FATHER 'a  flla  */ 

PROC  SORT  data*paople;  by  dad;  run; 

/»  fathar'a  record  aorgad  with  child 'a  •/ 

DATA  f amily(drop*dad) ; 

aarga  people (ln-chi Id 


kaep-eex  dad  aoa  blrthyr) 
father;  by  dad; 
if  child  thou  output; 
run; 

/*  Sort  FAMILY  by  aoa' a  Id  to  natch  up  •  / 
/*  aoa'a  Id  with  har  own  record  froa  tha  */ 
/*  MOTHER 'a  flla.  Uee  child'a  blrthyr  aa  a  •/ 
/•  aacond  aort  variable  •/ 

PROC  SORT  data*taally;  by  aoa  blrthyr;  run; 

/•  aothar'a  record  aorgad  with  child'a  */ 
DATA  faally; 

aarga  fanlly(in=ehild)  Bother;  by  bob; 

if  child  than  output; 

run; 


fig.  6. 


A  Solution  to  Problem  4 

Tbe  last  table  is  the  frequency  distribution  of  the  “last  births"  by 
the  age  of  the  mother  at  the  birth  of  her  last  offspring  and  by  the  sex 


of  that  last  offspring.  The  data  set  FAMILY  created  in  the  previous 
steps  have  most  of  the  information  needed  for  problem  4.  It  remains 
to  find  the  record  of  the  youngest  child  of  each  mother,  and  use 
the  child's  birth  year  and  the  mother's  birth  year  to  compute  the 
mother's  age  (figure  7).  PROC  FREQ  can  be  used  to  generate  the 
desired  table. 

/* . - . •/ 

/*  Craata  data  aat  which  coutaiua  racorda  •/ 

/•  tha  racorda  for  tha  yoangaat  chlldrau.  •  / 

/•  Coaputa  aoa'a  aga.  »/ 

/* . - . •/ 

DATA  youagaatfkaapsaax  aoaaga) ; 
aat  faally;  by  aoa; 
if  laat.aoa  than 

if  bob  "•  w  0*  than  do; 
aoaaga  •  birthyr  -  aoabora; 
output; 
ond; 

rut; 


fig.  T. 


Timing  Eutlmatea 

For  purposes  of  comparison  and  performance  evaluation,  some  tim¬ 
ing  estimates  for  tbe  execution  of  the  different  SAS  program  steps 
have  been  included  (figures  8  and  9).  All  the  steps  have  been  done 
on  an  IBM  PC/AT  model  99  with  640k  bytes  of  memory.  The 
machine  used  has  a  numeric  coprocessor.  Note  that  there  are  a 
lot  of  variables  retained  in  each  data  set,  hence  the  DATA  step 
took  a  proportionately  large  amount  of  time.  However,  after  the 
SAS  system  files  have  been  created,  data  retrieval  is  fairly  fast.  In 
fact,  the  step  which  gathered  all  the  necessary  information  for  the 
first  tabulation  took  only  27  seconds,  the  actual  tabulation  took  IS 
seconds.  We  also  note  that  sorting  is  very  fast. 


TRIPS  Dot*  Cm* 

Crute 

A. HOUSES 

79  ear*,  105  oba 

47  aeca 

A. CARS 

7*  ear*.  710  oba 

77  aera 

A. PERSONS 

70  ear*.  315  oba 

107  aee* 

A. TRIPS 

•0  vara.  045  oba 

27«  aeca 

Extrirt 

CARS 

3  vara.  210  oba 

11  aeca 

PERSONS 

3  vara.  315  oba 

20  aeca 

TRIPS 

3  vara.  403  oba 

23  aeca 

Solution  to  Problem  I 

Collate  Data 

HOUSES-f  PERSONS-f  CARS 

2  vara,  106  oba 

27  aeca 

Tabulate  data 

3x3  table 

16  tec* 

Solution  to  Problem  2 

Collate  Data 

Sort  PERSONS 

2  key*.  316  oba 

13  aeca 

Sort  TRIPS 

2  key*,  433  oba 

14  aeca 

TRIPS+PERSONS 

2  vara,  433  oba 

27  aeca 

Sort  TRIPACE 

2  keya,  433  oba 

13  aeca 

Sort  CARS 

2  key*.  210  oba 

12  aeca 

TRIPACE-fCARS 

2  vara.  433  oba 

27  aeca 

Tabulate  data 

3x3  table 

22  aeca 

fig.  8. 
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PEOPLE  Data  Base 

Create 

A.PEOPLE 

75  vara,  784  obs 

215  secs 

Extract 

PEOPLE 

6  vara,  784  obs 

28  secs 

FATHER 

2  vars,  392  obs 

33  secs 

MOTHER 

2  vars,  392  obs 

Solution  to  Problem  3 

Collate  Data 

Sort  PEOPLE 

1  key,  784  obs 

17  secs 

PEOPLE+FATHER 

4  vars,  784  obs 

32  secs 

Sort  FAMILY 

2  keys,  784  obs 

24  secs 

FAMILY +MOTHER 

6  vars,  784  obs 

34  secs 

Tabulate  data 

5X5  table 

25  secs 

Solution  to  Problem  4 

Extract  Data 

YOUNGEST 

2  vars,  213  obs 

24  secs 

Tabulate  Data 

4x2  table 

20  secs 

Conclusion 

The  PC/SAS  system  is  intended  to  be  a  complete  implementation 
of  the  SAS  system  as  available  to  the  mainframe  and  mini  computer 
users.  It  has  the  same  set  of  data  management  tools,  and  can 
perform  complex  data  analysis  with  as  much  ease.  Inspite  of  the 
limitations  of  the  personal  computers  and  the  complexity  of  the 
language,  execution  is  reasonably  fast.  Pot  small  and  medium  site 
problems,  we  see  it  as  an  alternative  to  the  mainframe  and  mini 
systems. 

Reference 

SAS  Language  Guide  for  Personal  Computers,  Version  6  Edition, 
SAS  Institute,  Cary,  North  Carolina. 


SAS  is  a  registered  trademark  of  SAS  Institute  Inc,  Cary,  N.C. 
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SPSS/PC  AND  SOLUTIONS  FOR  THE  TEITEL  COMPLEX  FILE  PROBLEMS 
Jon  K.  Peck.,  SPSS  Inc. 


1:  Introduction 

This  paper  shows  the  results  for  the  Complex 
File  Problems  1  and  2  that  were  posed  to  various 
microcomputer  statistical  packages.  The  approach 
shown  here  for  SPSS/PC+  is  the  same  as  it  would 
be  for  the  mainframe  SPSS-X  package.  An  SPSS-X 
solution  for  the  same  problem  was  presented  in  an 
earlier  paper  from  the  benchmark  session  at  an 
earlier  Interface.  A  discussion  of  the  solutions 
for  the  two  problems  is  followed  by  a  listing  of 
the  scripts  used.l  The  output  is  not  shown  in 
order  to  conserve  space.  The  scripts  were  run  as 
batch  jobs,  but  users  could  also  run  these 
solutions  interactively. 

2:  Parameters  of  the  Solution 

The  required  tables  are  computed  in  SPSS/PC+ 
using  multiple  system  files.  A  system  file  is  a 
binary  file  that  contains  . oth  the  data  and  the 
variable  and  value  definitions  and  label  informa¬ 
tion.  By  default,  these  files  embody  a  com¬ 
pression  algorithm  that  often  substantially 
reduces  the  file  size  over  the  uncompressed 
version  at  the  cost  of  slightly  longer  processing 
time.  It  is  common  for  the  compressed  file  to  be 
one-third  the  size  of  the  uncompressed  file.  The 
user  can,  however,  instruct  the  system  to  use 
uncompressed  files  instead.  The  user  can  also 
direct  these  files  to  a  RAM  disk  in  order  to 
speed  execution.  SPSS/PC+  will  allocate  the 
maximum  possible  workspace  area  In  memory  unless 
instructed  otherwise,  but  the  amount  of  memory 
available  has  no  effect  on  the  problems  discussed 
here  as  long  as  the  system  minimum  requirements 
are  met.  Finally,  the  presence  of  the  optional 
math  coprocessor  can  significantly  reduce  the  run 
time  for  problems  in  SPSS/PC+.  The  output  can  be 
formatted  for  various  page  widths  and  character 
sets  according  to  the  printing  device  to  be  used. 

3:  The  Primary  Commands  Used  in  the  Solution 

The  original  ASCII  data  are  read  with  the  DATA 
LIST  command.  The  other  main  commands  that  are 
used  here  are  GET,  SORT,  JOIN  (MATCH),  AGGREGATE, 
and  CROSSTABS.  GET  selects  a  system  file  as  the 
active  dataset  and  dictionary;  SORT  sorts  cases 
on  up  to  ten  variables;  JOIN  (with  aliases  MATCH 
and  ADD)  performs  an  (outer)  join  on  up  to  five 
files  including  a  table-lookup  facility  or  adds 
new  cases  to  the  file;  AGGREGATE  combines  groups 
of  cases  with  a  choice  of  aggregation  functions 
and  missing  value  treatments,  and  CROSSTABS  pro¬ 
duces  n-way  tables.  All  of  these  facilities  are 
part  of  the  SPSS/PC+  base  system.  Lines  start¬ 
ing  with  *  are  comments,  and  lines  starting  with 
DOS  are  DOS  operating  system  commands.  The  DOS 
commands  used  here  simply  delete  superfluous 
files  at  the  end  of  the  run,  but  SPSS/PC+  permits 
any  reasonable  DOS  command  or  program  to  be 
executed  during  a  run. 

4:  Complex  File  Problem  1:  Cars 

First,  the  four  data  sets,  Households,  Cars, 
Persons,  and  Trips,  are  defined  and  made  into 


system  files  containing  all  of  the  variables  in 
the  original  data  using  DATA  LIST  and  SAVE. 

Since  these  system  files  contain  many  more  vari¬ 
ables  than  are  required  to  compute  the  required 
tables,  superfluous  variables  are  dropped  during 
subsequent  joins  and  aggregations.  Second, 

HOUSES,  PERSONS,  and  CARS  are  joined  on  the 
household  id,  HOUSE.  An  outer  join  where  one  file 
has  no  matching  records  produces  missing  values 
for  all  variables  taken  from  that  file.  Therefore, 
the  car  id  variable,  CAR,  is  recoded  to  zero  if 
missing  and  one  otherwise.  Thus  when  the  joined 
file  is  aggregated  by  household,  the  sum  of  the 
recoded  car  variable  will  give  the  number  of  cars 
owned  by  the  household.  An  indicator  variable  is 
defined  for  persons  with  age  over  16.  This  con¬ 
dition  was  interpreted  as  "age  >  16"  which  means 
here  age  of  17  or  more. 

Next  the  joined  file  is  aggregated  by  HOUSE 
producing  a  file  in  which  the  unit  of  analysis  is 
the  household  and  which  has  variables  giving  the 
number  of  cars,  NUMCARS,  (zero  or  more)  and  the 
number  of  persons  over  16  (DRIVERS)  using  the  SUM 
function  of  AGGREGATE.  This  function  treats 
missing  values  as  zero  unless  instructed  other¬ 
wise,  but  when  there  are  no  records  in  the  group 
with  a  nonmissing  value,  the  resulting  SUM  has 
the  value  missing.  RECODE  is  used  to  designate 
these  missing  cases  as  zero.  Finally,  the  cross¬ 
tabulation  of  NUMCARS  by  DRIVERS  produces  the 
first  required  table.  In  these  data  there  are 
households  having  drivers  but  no  cars  and  cars 
but  no  drivers.  The  number  of  households  with 
each  possible  number  of  cars  is  the  same. 

For  the  second  table,  the  age  of  each  person 
taking  a  trip  is  added  to  the  trip  record  by 
joining  the  TRIPS  system  file  with  the  PERSONS 
system  file  on  household  and  person  ids  treating 
PERSONS  as  a  lookup  table.  After  selecting 
according  to  the  required  conditions  of  trips  of 
at  least  three  days  duration  and  in  the  house¬ 
hold's  own  car,  the  joined  file  is  sorted  into 
house  and  car  order.  Next,  the  model  year, 
variable  CARYEAR  from  CARS,  is  added  to  the  file 
as  a  table  lookup  based  on  household  id  and  car 
number.  From  this,  the  table  of  AGE  by  CARYEAR 
is  produced  by  CROSSTABS. 

5:  Complex  File  Problem  2:  People 

For  this  problem  two  tables  are  to  be  produced: 
first,  the  number  of  offspring  classified  by 
father's  and  mother's  education,  and  second,  the 
number  of  mothers  classified  by  mother's  age  at 
birth  of  last  offspring  and  sex  of  that  offspring. 
This  problem  has  only  a  single  input  file,  which 
contains  data  on  a  set  of  people  including  both 
offspring  and  parents.  This  is  the  structure  of 
the  classic  employee-manager  database. 

First,  the  dataset  is  defined  and  converted  to 
the  PEOPLE  system  file  using  DATA  LIST  and  SAVE, 
although  constructing  the  required  tables  does  not 
actually  require  this  system  file.  Second,  the 
father's  education  and  the  mother's  education  and 
birth  year  and  the  parents'  id  numbers  are  added 
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to  the  active  file.  This  is  accomplished  by  two 
table-lookup  joins  of  the  dataset  against  itself 
via  SORT  and  MATCH.  The  join  table  is  a  system 
file  named  PERTAB  that  contains  only  the  vari¬ 
ables  actually  needed.  Third,  the  table  of  off¬ 
spring  by  parents'  education  is  produced  by 
CROSSTAB.  The  table  is  constructed  with  a  separ¬ 
ate  row  and  column  tallying  the  cases  where  the 
parent  data  are  missing. 

For  the  second  table,  the  procedure  is  as 
follows.  The  cases  in  the  active  file  from  the 
previous  table  exercise  are  sorted  by  mother  id 
and  B1RTHYR.  Next,  for  those  cases  where  the 
mother  is  known  (MOTHER  >  0),  the  file  is  aggre¬ 

gated  over  MOTHER  retaining  the  last  occurrence 
of  birth  year,  sex,  and  mother's  birth  year 
(MABIRYR) .  From  this  aggregated  file,  the  moth¬ 
er's  age  at  birth  of  last  child,  8IRTHAGE,  is 
simply  BIRTHYR  -  MABIRYR.  Finally,  CROSSTABS 
computes  the  required  table. 

The  results  of  the  second  exercise  reveal  a 
remarkably  effective  way  to  determine  the  sex  of 
ones  final  offspring. 

6:  Timing  and  File  Size  Data 

The  time  required  to  complete  each  task  and 
the  various  file  sizes  depend  on  the  computer 
configuration  and  the  SPSS/PC+  options  selected. 
The  system  file  sizes  in  K-bytes  were  as  follows. 

Cars 


Compressed 

Uncompressed 

HOUSES 

16 

69 

PERSONS 

43 

202 

CARS 

30 

136 

TRIPS 

126 

609 

People 

Compressed 

Uncompressed 

PEOPLE 

119 

487 

PERTAB 

28 

44 

Running  times  are  reported  for  an  IBM  PC/AT 
with  an  80287  math  coprocessor  and  for  an 
IBM  PC/XT  with  an  8087  math  coprocessor.  For  the 
AT,  times  are  reported  with  the  files  other  than 
the  Initial  ASCII  files  stored  on  a  RAM  disk  and 
stored  on  a  hard  disk.  On  the  XT  times  are 
reported  for  compressed  and  uncompressed  files 
stored  on  a  hard  disk.  Times  are  reported  in 
minutes.  An  *  means  that  that  case  was  not  run. 

Machine  PC/AT  PC/XT 

Data 


Location 

RAM  Disk 

Hard  Disk 

Hard 

Disk 

Compression 

Option 

On 

On 

On 

Off 

Cars 

9.25 

10.5 

21.5 

20.1 

People 

5.5 

* 

15.9 

13.1 

It  should  be  emphasized  that  these  times  depend 
on  many  particulars  of  the  machine  and  program 
settings,  which  will  vary.  They  do  suggest, 
however,  that  an  AT  runs  these  problems  about 
twice  as  fast  as  an  XT  and  that  using  a  RAM  disk 
on  an  AT  for  file  storage  has  only  a  modest 
effect.  For  files  such  as  these,  compression 
dramatically  reduces  the  system  file  sizes  at  a 


very  modest  decrease  in  execution  speed.  The  XT 

to  AT  speed  comparison  is  likely  to  be  quite 

different  for  machines  without  a  math  coprocessor. 

7:  Command  Script:  Problem  1 

SET  more  off  echo  on  compress  on  width  wide 
length  60  listing  'cars.lis'. 

TITLE  'Household  File'. 

DATA  LIST  file- 'persons  .dat '/house  1-4  hal  to 
ha51  10-60  hbl  to  hb20  61-100  hdl 
to  hd5  101-120. 

SAVE  out” 'houses  .  sys ' . 

TITLE  'Persons  File' 

DATA  LIST  f ile- 'persons .dat ' /house  1-4  person 

6-7  pal  to  paSl  10-60  age  pb2  to  pb20 
61-100  pci  to  pc5  101-120. 

SAVE  out- ' persons . sy s ' . 

TITLE  'Cars  File' 

DATA  LIST  file- 'cars.dat '/house  1-4  car  6-7 

cal  to  ca51  10-60  cbl  to  cb20  61-100 
cdl  to  cd4  caryear  101-120. 

SAVE  out-'cars.sys' . 

TITLE  'Trips  File'. 

DATA  LIST  file-'trips.dat'/house  1-4  person 
6-7  trip  8-9  owncar  ta2  to  ta51 
10-60  tbl  to  tbl9  days  61-100 
tdl  to  td5  101-120. 

SAVE  out-'trips.sys' . 

TITLE  'Crosstabulate  number  of  cars  by  number  of 
persons  over  16'. 

SUBTITLE  'Unit  of  analysis  is  households'. 

*  count  households  by  the  number  of  cars  owned 
and  number  of  persons  over  16. 

MATCH  f ile- 'houses. sys' /keep  house/ 

file- 'per sons. sys '/keep  house  age/ 
file- 'cars. sys '/keep  house  car/by  house. 

RECODE  car  (sysmis-0)  (else-1). 

COMPUTE  age  16-age  >  16. 

*AGE16  equals  1  if  person  over  16. 

AGGREGATE  outf  ile-*/break-house/ 

numcars  'Number  of  Cars  Owned'  - 
sum(car) 

drivers  'Number  of  Persons  over 
16  years  old'  =  sum(agel6). 

FORMATS  numcars  drivers  (F2.0). 

RECODE  numcars  drivers  (sysmis-0) . 

SET  EJECT  ON. 

CROSSTABS  numcars  by  drivers, 

SET  EJECT  OFF. 


TITLE  'Crosstabulate  long  trips  by  age  of  person 
and  car ' . 

SUBTITLE  '  Unit  of  analysis  is  Trips'. 

MATCH  file-'trips.sys'/rename  (owncar-car)  keep 
car  house  person  days / 

table- ' persons . sy s 1 /keep  age  house  person/ 
by  house  person. 

♦LOOKUP  PERSON'S  ACE 

SELECT  IF  (days  ge  3  and  car  ne  0) . 

♦SELECT  TRIPS  OF  3  DAYS  PLUS. 

SORT  CASES  by  house  car. 

MATCH  flle-*/keep«age  house  car / 

table-'cars.sys'/keep-caryear  house  car/ 
by  house  car. 

SET  EJECT  ON. 

CROSSTABS  age  by  caryear. 

SET  EJECT  OFF 
DOS  erase  house. sys. 

DOS  erase  person. sys. 

DOS  erase  cars. sys. 

DOS  erase  trips. sys. 

FINISH. 

8:  Command  Script:  Problem  2 

SET  listing  'people2.1is' 

more  off  echo  on  compress  on  width  wide 
length  60  workdev  d: . 

TITLE  'People  File'. 

DATA  LIST  file-'c:\teitel\people.dat'/person  1-5 
sex  a2  to  a51  6-56  educ  b2  to  bl5 
57-86  birthyr  d2  to  d6  87-110  mother 
father  111-120. 

SAVE  out- ' d :  people .  sys ' . 

GET  file-'d:people.sys'/drop  a2  to  a51  b2  to  bl5 
d2  to  d6. 

SAVE  file-'d:pertab. sys '/drop  sex  father  mother. 


TITLE  "Count  persons  by  father's  and  mother's 
education". 

SORT  CASES  by  father. 

MATCH  table- 'd:pertab. sys '/rename  (person-father) 
(educ-paeduc)/  keep-father  paeduc/ 
file-*/keep  person  father  mother  educ 
birthyr  sex/  by  father. 

SORT  CASES  by  mother 

MATCH  table- 'd:pertab.sys' /rename  (person-mother) 
(educ-maeduc)  (birthyr-mabiryr) / 
keep  mother  maeduc  mabiryr/ 
file-*/by  mother. 

RECODE  paeduc  maeduc  (sysmis-99) . 

MISSING  VALUES  paeduc  maeduc  (99) . 

SET  eject  cn 

CROSSTAB  paeduc  by  maeduc/options  1. 

SET  eject  off. 

TITLE  'Mothers,  children,  and  all  that'. 

SORT  CASES  mother  birthyr. 

PROCESS  IF  (mother  >  0) . 

AGGREGATE  out-*/presorted/break  mother/ 

birthyr  «  last (birthyr) /sex-last (sex)/ 
mabiryr  -  last (mabiryr) . 

COMPUTE  birthage  -  birthyr  -  mabiryr. 

VARIABLE  LABELS  birthage  "Mother's  age  at  last 
birth"/ 

sex  'Sex  of  last  offspring'. 
VALUE  LABELS  sex  1  'Male'  2  'Female'. 

SET  eject  on. 

CROSSTABS  birthage  by  sex. 

SET  eject  off. 

♦DOS  erase  d: people. sys. 

♦DOS  erase  d"pertab.sys. 

FINISH 


1  The  solutions  were  prepared  by  ViAnn  Beadle 
and  Jon  Peck  based  on  the  earlier  work  of 
Jon  Fry  for  SPSS-X. 
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THE  CASE  OF  THE  MISSING  DATA 

Leland  Wilkinson 

University  of  Illinois  at  Chicago  and  SYSTAT,  Inc. 

Grant  Blank 
University  of  Chicago 


The  real  data  in  this  exercise  are  missing.  On 
examining  closely  the  TRIPS  and  PERSONS  data  provided 
by  Robert  F.  Teitel,  we  are  forced  to  conclude  that  his 
data  are  not  real.  The  following  evidence  is  offered 
to  support  our  conclusion. 

Evidence 

A.  Teitel  requests  a  tabulation  of  cars  owned  by  number 
of  persons  over  16.  We  fit  a  log-linear  model  to  this 
table  and  found  a  highly  insignificant  chi-square  for 
an  interaction  hypothesis.  As  anyone  knows,  households 
with  teenagers  have  significantly  more  cars  (including 
those  in  the  body  shop).  The  following  plot  of 
standardized  residuals  to  the  additive  model  shows  no 
conspicuous  deviation  from  normality. 
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B.  Teitel  requests  a  table  of  age  of  trip-taking 
persons  by  the  model  year  of  their  cars.  It  is  common 
knowledge  that  model  years  comprise  a  simplex  data 
structure  because  people  buy  cars  as  soon  as  they  see  a 
new  one  in  their  neighbors'  driveway.  To  test  this,  we 
computed  gamma  coefficients  on  the  columns  of  the  table 
(model  year)  and  did  a  multidimensional  scaling  on  the 
matrix  of  resulting  coefficients.  The  following  plot 
shows  the  model  years  in  alphabetical  order.  This  plot 
resembles  a  random  walk  more  than  a  one  dimensional 
simplex. 
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C.  Teitel  requests  a  table  of  mother’s  versus  father's 
education.  We  reduced  this  table  to  a  single  count  per 
cell  and  performed  an  influence  scatterplot.  As  can  be 
seen  plainly  in  the  upper  left  hand  corner,  three  of 
the  cells  reduce  the  correlation  by  at  least  .10. 
These  consist  of  highly  educated  women  married  to 
dolts.  This  violates  a  law  of  nature. 
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D.  Finally,  Teitel  requests  a  table  of  mother's  age  by 
sex  of  last  offspring.  Here  is  the  table. 


TABLE  OF  B1RTHAGES  (ROUS)  BY  SEX*  (COLUMNS) 
FREQUENCIES 


Femle 

Male 

TOTAL 

o-ia 

4 

2 

6 

19-25 

14 

19 

35 

26-30 

25 

17 

42 

31-35 

47 

85 

132 

TOTAL 

90 

123 

213 

We  did  a  grouped  box  plot  to  check  the  distributions  of 
mothers'  age  at  the  birth  of  their  last  child.  As 
anyone  knows,  last  born  females  should  be  associated 
with  higher  maternal  ages  because  families  continue  to 
reproduce  until  they  have  a  woman  who  can  grow  up  and 
become  a  statistician. 
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(c)  19M  Laland  Wilkinson  and  Grant  Blank 
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Conclusion 


Several  mainframe  and  microcomputer  database 
packages  offer  the  tools  to  solve  the  volume  testing 
problem  in  this  symposium.  In  fact,  the  database 
packages  which  include  crosstabulation  can  probably 
solve  them  faster  and  with  fewer  commands  than  any  of 
the  statistics  packages  surveyed  here. 

A  good  statistics  package  should  do  more,  however. 
It  should  be  able  to  integrate  database  information 
with  all  of  its  statistical  procedures  without  using 
complex  commands.  Tables  are  the  beginning  of 
statistical  analysis,  not  the  end. 

Because  of  this  simple  distinction,  we  approached 
this  database  problem  differently  from  the  other 
vendors  in  this  symposium.  First,  although  SYSTAT  can 
process  character  data,  we  treated  the  data  as  double 
precision  numbers  instead  of  one  or  two  byte 

characters.  As  a  consequence,  our  files  are  larger  and 
required  more  time  to  convert  from  the  raw  data. 

Notice,  however,  that  our  statistical  procedures  read 
and  computed  tables  on  these  numerical  files  as  fast  as 
the  others  computed  tables  from  character  files. 


Second,  to  uncover  the  database  fraud,  we  needed 
to  produce  more  than  tables.  Because  of  the  way  people 
process  multivariate  information,  tables  are  one  of  the 
worst  methods  for  displaying  complex  multivariate 
relationships.  We  chose,  instead,  a  few  of  SYSTAVs 
graphical  displays  which  reveal  at  a  glance  the 
artificiality  of  the  data.  Some  of  these  graphics 
resemble  ones  in  the  other  programs,  but  the  appearance 
is  deceiving.  Only  SYSTAT  offers  multidimensional 
scaling,  generalized  gamma  coefficients,  influence 
plots  and  grouped  box  plots.  Furthermore,  since  SYSTAT 
can  save  tables  into  files  and  treat  them  as  ordinary 
data,  accessing  these  other  procedures  requires  only 
one  command. 

The  following  Figures  1  through  4  provide  the 
SYSTAT  code  for  producing  the  required  tables.  We 
could  have  speeded  up  processing  by  using  programming 
tricks,  but  the  code  in  these  figures  is  more  typical 
of  the  average  SYSTAT  programmer's  approach  to  the 
problems.  We  believe  the  human  processing  time  is  as 
important  as  the  computer  processing  time. 
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>  Figure  1 

\  Read  Input  Datasets 

\  DATA 

NOTE  'Read  HHOLDS  records.' 

SAVE  HHOLDS  /  'Household  dataset'  'ID  variable:  HHOLD' 

'  GET  HHOLDS 

LRECL  -  120 

INPUT  (HHOLD  RECTYPEI  ONEl(l-5l)  TWOK1-20)  FOURl(l-5)), 

(#4  #1  '10  51*4*1  20*#2  5*#4  ) 

i  RUN 

,  NOTE  'Read  CARS  records.’ 

,  SAVE  CARS  /  'Cars  Dataset’  'ID  variable:  HHOLD  CARS' 

i  GET  CARS 

LRECL  =  120 

INPUT  (HHOLD  RECTYPE2  CAR  ONE2(l-5l)  TWO2O-20)  FOUR2(l-4)  MYEAR), 

(#4  #1  #2  '10  51*4*1  20*#2  4*04  #4  ) 

RUN 

NOTE  'Read  PERSONS  records.’ 

SAVE  PERSONS  /  'Persons  Dataset’  'ID  variable:  HHOLD  PERSON' 

GET  PERSONS 
LRECL  «  120 

1  INPUT  (HHOLD  RECTYPE3  PERSON  ONE3(l-5I)  AGE  TWO30-I9)  FOUR3(l-5)), 

(#4  #1  #2  '10  51*4*1  #2  ]  9*4*2  5*#4  ) 

RUN 

NOTE  'Read  TRIPS  records.’ 

SAVE  TRIPS  /  'Trips  Dataset’  'ID  variable:  HHOLD  PERSON  TRIP’ 

GET  TRIPS 
LRECL  -  120 

INPUT  (HHOLD  RECTYPE4  PERSON  TRIP  CAR  ONE4(1-50)  TW04(1-19)  NDAYS  FOUR4<!-5)) 
(#4  #1  #2  #2  #1  50*#  1  19*4*2  #2  5*#4  ) 

RUN 

NOTE  'Read  PEOPLE  records.’ 

SAVE  PEOPLE  /  'Sorted  by  PERSON’ 

GET  PEOPLE 

;  LRECL  -  120 

INPUT  (PERSON  SEX  ONEO-50)  EDUC  TWOO-I4)  BIRTHYR  FOURO-5)  MOTHER  FATHER). 
(#5  #1  50***1  #2  14*#2  #4  5*#4  #5  #5  ) 

RUN 


Figure  2 

The  TRIPS  Data  Collection 
Producing  Table  1 

DATA 

SAVE  TEMP 

USE  HHOLDS  (HHOLD)  CARS  (RECTYPE2  HHOLD)  /  HHOLD 

BY  HHOLD 

HOLD 

NOTE  'Count  number  of  cars.’ 

NOTE  'Only  increment  cars  count  if  household  exists  in  CARS  dataset.’ 

IF  BOG  THEN  LET  NCARS  =  0 
IF  RECTYPE2  <>  .  THEN  LET  NCARS  -  NCARS  +  1 
IF  EOG  -  0  THEN  DELETE 
RUN 

NEW 

SAVE  PROB1P1 

USE  TEMP  PERSONS  (RECTYPE3  AGE  HHOLD)  /  HHOLD 

BY  HHOLD 

HOLD 

NOTE  'Count  number  of  persons  over  age  16.’ 

IF  BOG  THEN  LET  NPERSONS  -  0 

IF  RECTYPE3  <>  .  AND  AGE  >  16  THEN  LET  NPERSONS  -  NPERSONS  +  1 
IF  EOG  -  0  THEN  DELETE 
RUN 
QUIT 

TABLES 
USE  PROB1P1 

TABULATE  NCARS  *  NPERSONS 
QUIT 


Figure  3 

The  TRIPS  Data  Collection 
Producing  Table  2 

DATA 

SAVE  TEMP 

USE  PERSONS  (HHOLD  PERSON  AGE  RECTYPE3)  TRIPS  (HHOLD  PERSON  CAR  NDAYS)  , 
/  HHOLD  PERSON 

IF  NDAYS  <  3  OR  RECTYPE3  -  .  OR  CAR  -  0  THEN  DELETE 
SORT  HHOLD  CAR 
RUN 

SAVE  PROB1P2 

USE  TEMP  CARS  (HHOLD  CAR  RECTYPE2  MYEAR)  /  HHOLD  CAR 
IF  RECTYPE2 -  .  OR  RECTYPE3 -  .  THEN  DELETE 
RUN 
QUIT 

TABLES 
USE  PROB1P2 

TABULATE  AGE  •  MYEAR 
QUIT 
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Figure  4 

The  PERSONS  Data  Collection 
Producing  Table  3 

DATA 

NOTE  'Dataset  of  mothers  sorted  by  MOTHER.' 

NOTE  'It  is  crucial  that  FATHER  be  included  in  this  dataset.’ 
SAVE  MOTHER  /  'Sorted  by  MOTHER’ 

USE  PEOPLE  (MOTHER  FATHER) 

NOTE  'Delete  cases  with  no  mother.' 

IF  MOTHER  =  0  THEN  DELETE 
LET  INCLUDES  =  'YES' 

SORT  MOTHER 
RUN 

NOTE  'Dataset  with  mothers' 

SAVE  PEOPLE2 

USE  PEOPLE  (PERSON  EDUC) 

NOTE  'Rename  PERSON  to  MOTHER  and  EDUC  to  MOTHED' 
LET  MOTHER  -  PERSON 
LET  MOTHED  -  EDUC 
DROP  PERSON  EDUC 
RUN 

NOTE  'Dataset  of  mothers  education  joined  to  offspring  record.’ 
NOTE  'Sorted  by  FATHER.’ 

SAVE  MOTHER2 

USE  MOTHER  PEOPLE2  /  MOTHER 
IF  INCLUDES  <>  'YES'  THEN  DELETE 
SORT  FATHER 
RUN 

NOTE  'Dataset  with  fathers.’ 
caVF  PFOP1  F3 

USE  PEOPLE  (PERSON  EDUC) 

NOTE  'Rename  PERSON  to  FATHER  and  EDUC  to  FATHED' 
LET  FATHER  -  PERSON 
LET  FATHED  >  EDUC 
DROP  PERSON  EDUC 
RUN 

NOTE  'Add  fathers  education  to  dataset.' 

NOTE  'Dataset  contains  both  fathers  and  mothers  education,  sorted’ 
NOTE  ’  by  FATHER.’ 

SAVE  PROB2P1 

USE  MOTHER2 PEOPLE3 / FATHER 
IF  INCLUDES  <>  'YES’  THEN  DELETE 
DROP  INCLUDES 
RUN 
QUIT 

TABLES 
USE  PROB2P1 

TABULATE  MOTHED  *  FATHED 
QUIT 


The  PERSONS  Data  Collection 
Producing  Table  4 

DATA 

SAVE  MOTHER 

USE  PEOPLE  (BIRTHYR  PERSON) 

LET  MOTHER  -  PERSON 
LET  MBIRTHYR  -  BIRTHYR 
DROP  BIRTHYR  PERSON 
RUN 

SAVE  MOTHER2 

USE  PEOPLE  (MOTHER  BIRTHYR  SEX) 

SORT  MOTHER 
RUN 

SAVE  MOTHER3 

USE  MOTHER2  MOTHER  /  MOTHER 
IF  MBIRTHYR  =  .  OR  BIRTHYR  -  .  THEN  DELETE 
LET  BIRTHAGE  =  BIRTHYR  -  MBIRTHYR 
SORT  MOTHER  BIRTHAGE 
RUN 

SAVE  PROB2P2 
USE  MOTHER3 
BY  MOTHER 

IF  EOG  =  0  THEN  DELETE 
LABEL  SEX  /  1  -  ’Female’  2  =  ’Male’ 

IF  BIRTHAGE<19  THEN  LET  BIRTHAGES=*’0-18’ 

IF  BIRTHAGE>18  AND  BIRTHAGE<26  THEN  LET  BIRTHAGES-’19-2S’ 
IF  BIRTHAGE>25  AND  BIRTHAGE<31  THEN  LET  BIRTHAGE$=’26-30’ 
IF  BIRTHAGE>30  AND  BIRTHAGE<36  THEN  LET  BIRTHAGES=’31-35’ 
RUN 

TABLES 
USE  PROB2P2 

TABULATE  BIRTHAGES  *  SEX$ 

QUIT 


STATISTICAL  DATABASE  MANAGEMENT  ON  MICROCOMPUTERS:  THE  BENCHMARK  RESULTS 


Robert  F.  Tettel 

TEITEL  DATA  SYSTEMS 
Bethesda.  MD  20*14 

Thia  papar  praaanti  tha  results  of  esecuting  tha  solutions  to  a  aat  of 
data  manipulation  problem*  supplied  by  vendors  of  alctoeoapntat>bifad 
atatiatical  iff taai  on  a  couon  alctocoaputat  Tha  deaeriptlon  of  the 
data  flies  and  tha  data  manipulation  piobleas  are  found  in  a  companion 
paper.  ..  Tha  Benchmark  Problems",  elsawhara  in  these  proceedings. 
Tha  description  of  each  vendor's  solutions  to  tha  benchmark  problems  ara 
also  found  elsawhara  in  these  proceedings. 


PARTICIPATING  VENDORS 


BMDP 

DASY 

PRODAS 

PSTAT 

SAS/PC 

SPSS/PC 

SYSTAT 


BMDP  Software;  Hit  Westwood  Drivo;  Los  Angeles.  CA  V0023 

Statistical  Software  Resources;  203SS  Seaboard  Avenue;  Malibu,  CA  101(1 

Conceptual  Software;  P.O.Boe  51427,  Houston.  TI  77245 

PSTAT  Inc.;  471  Wall  Street;  Research  Park;  Princeton.  NJ  00540 

SAS  Institute;  P  0  Bos  0000;  Cary.  NC  27511 

SPSS  Inc.;  444  North  Michigan  Avenue;  Chicago.  IL  40411 

SYSTAT  Inc  ;  2*02  Central  Street;  Evanston,  IL  40202 


I  INTRODUCTION 

Tha  performance  figures  to  be  presented 
below  —  for  database  creation  time  and 
database  sise.  and  esecution  of  tha 
benchmark  data  manipulation  problems  -- 
are  based  on  the  batch-oriented  lob 
streams,  or  "scripts",  submitted  by  the 
vendors.  Each  script  was  asacuted  lust 
once,  and  performance  was  monitored 
with  a  stopwatch.  The  indicated  total 
time  should  be  correct  to  within  5  or 
10  seconds  However,  it  has  been  our 
esperience  in  other  microcomputer 
timings  that  ware  these  benchmarks  to  be 
rerun  --  as  we  intend  to  do  --  the 
numbers  could  vary  by  as  much  as  13  or 
20%.  Much  of  the  variation  will  be  due 
to  the  varying  distribution  of  the  data 
and  program  files  involved  in  each  test 
on  tha  (usually)  single  large  ("hard", 
“fiaed")  disk  drive  on  most  micro¬ 
computer  systems. 

Tha  equipment  used  for  tha  performance 
tests  consisted  of  an  IBM  PC/XT  with  an 
0007  Numeric  Data  Coprocessor  (NDP)  and 
a  30  HB  disk  of  unknown  brand  with 
about  10MB  of  free  space  for  our  use. 

After  studying  the  various  scripts, 
published  elsewhere  in  these  procee¬ 
dings,  and  the  timing  figures  below,  I 
submit  tha  following  conclusion  can  be 
drawn-  the  variation  among  the  vendors 
if  noftifllt*  tempered  to  the  time  it 
would  take  a  typical  user  to  prepare 
the  lob  etroams. 


11.  BCRIPT  DISSIMILARITIES 

There  ara  a  number  of  known  dissimila¬ 
rities  among  tha  scripts  from  each 
vendor  which  would  impact  performance. 
These  include  the  following.  The  SYSTAT 
database  sites  could  have  bean  reduced 
to  half  that  shown,  with  apparently 
little  affect  on  performance.  The  PSTAT 
scripts  for  the  creation  of  tho  two 
databases  assumed  the  input  data  flies 
would  come  from  a  floppy  disk;  all 
other  vendors  assumed  that  the  data 
files  were  on  tha  large  disk.  The  BMDP 
scripts  do  not  save  the  created  data¬ 
bases;  thus  no  database  sises  are 


The  PSTAT  timings  include  printing  of 
the  final  tabulations  because  those 
scripts  were  the  only  ones  to  print 
lust  the  final  tabulations,  all  the 
other  scripts  required  either  that 
everything  or  nothing  be  printed.  Since 
the  amount  of  printing  might  have 
materially  affected  the  esecution 
timings,  the  non-PSTAT  ecripts  were 
esecuted  without  printing  anything. 

The  PRODAS  scripts  for  the  second 
(PEOPLE)  set  of  problems  asssumed  space 
esieted  for  an  in-core  list  of 
reforsnesd  tecord  keys  (as  there  was 
for  the  data  distributed).  The  BMDP 
scripts  for  these  problems  esplieitly 
eerted  the  referenced  record  keys, 
whereas  "parents"  (for  the  first 
tabulation)  and  "mothers"  (for  the 
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■•eond  tabulation)  always  occurred 
bafora  tha  appllcabla  child  racord  in 
tha  distributed  data,  a  tact  which 
could  ba  used  to  eliminate  tha  sorts. 

Tha  OAST  scripts  wars  not  prosantad  at 
tha  benchmarking  sassion.  Koraarar, 

DASY  doesn't  have  a  concept  of  a 
"systems  file"  or  a  "database":  it 
salaeta  from  input  files  only  thosa 
data  fields  necessary  tor  a  particular 
task.  PRODAS  submitted  a  similar  set  of 
scripts,  in  addition  to  thosa  which 
used  a  database  constructed  from  tha 
full  sat  of  distributed  data.  These  two 
"speedy"  results  are  presented  together 
below,  and  represent  likely  similar 
results  for  tha  other  systems  eiecuting 
in  this  fashion. 

Finally,  the  vendors  selected  one  of 
two  basic  script  organisations.  Since 
there  are  two  tabulations  requested 
from  each  database,  some  scripts  per¬ 


form  some  commi 
tabulations  (e1 
subset  databasi 
variables  neee 
la t i ons  > .  Othe 
tabulations  an 
and  proceed  to 
Clearly,  perfo 
comparable  aen 
therefore  presi 
(Note  that  the 
sar i 1 y  cons i s 1 1 
organisation  f< 
lations,  the  o 
tabulat ions . ) 


on  procedures  for  the 
ven  if  only  creating  a 
e  containing  only  the 
ssary  for  the  two  tabu- 
r  scripts  assume  that  tha 
e  totally  independent, 
produce  them  separately, 
rmance  results  are  not 
oss  such  scripts,  and  an 
anted  separately  below. 

vendors  were  not  neces- 
ent ,  using  one  script 
or  the  TRIPS  data  tabu- 
ther  for  the  PEOPLE  data 


Other  dissimilarities  will,  no  doubt, 
be  discovered.  In  addition,  eapert 
users  of  these  systems  might  find 
clever  trieks  to  improve  the  perfor¬ 
mance.  The  scripts  ware  eaecuted  as 
provided  by  the  vendors,  and  represent 
(we  hope)  scripts  which  a  typical  users 
might  have  constructed. 


The  following  table  displays  the  number  of  records,  raw  ASCII  sise,  database  sice  by 
component,  and  time  to  load  the  database  by  vendor  for  the  TRIPS  data  collection. 
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The  following  table  display*  tho  proeossing  tins  roqoirod  for  itch  TRIPS  data 
eolloetion  seript  by  script  organisation  by  vendor. 


I  hho Ids /ears /parsons /trips  ! 

♦ - - - 

I 

♦  - -s 

I  proeass  I 

♦  - -♦ 

/  \ 

♦  - - s  * - ♦ 

I  stab  1  •  stab  2  I 

♦  - - ♦  + - ♦ 


♦ - - - - - - ♦ 

I  hhot ds/cars /par sons / t r ips  I 
♦ - - - - - ♦ 


♦ - - 

:  procsss 


♦ - ♦ 

I  stab  1  I 
♦ - s 


i  process  I 
♦ - ♦ 


I  stab  2 
♦ - 


Bisection  tino 
mins : socs 


Elocution  time 

total  ainsisaes  ainssscs 


BMDF 

PROOAS 

PSTAT 

SAS/PC 

SPSS/PC 

STSTAT 


13:30 
0  :  43s 


20  :  00« 
4:00s 


10  :  30s 
10:30s 


13:43 
4  :  30 


12:43 
7  :  43 


<Only  PSTAT  provided  table  only  printing;  The  others  wars  run  with 
display  on  screen  only  (that  is.  without  any  printing). > 


PROOAS 

DASY 


3:00s 

4:30s 


(The  above  two  results  are  for  loading  only  tha  variables  naadad  in 
order  to  perform  tha  requested  tabulations.) 


Tho  following  table  displays  the  number  of  records,  raw  ASCII  siss,  dati 
and  time  to  load  the  database  by  vendor  for  the  PEOPLE  data  collection. 
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Thv  following  Ubli  displays  tho  processing  time  required  for  ooeh  PEOPLE  data 
collection  script  bp  script  organisation  by  vendor. 


1  stab  4  I 


I  stab  4 


Esecution  time 

Esecution 

time 

mins : secs 

total 

mins : secs 

mins : secs 

SHOP  13 : 30  + 

PROOAS  3:43  + 

PSTAT  10:30 

SAS/PC  0:13+ 

SPSS/ PC 

14:30+  4:30 

10:00 

SYSTAT 

14:00+  4:00 

10:00 

(Only  PSTAT  provided  table  only  printing;  The  ethers  were 
display  on  screen  only  (that  is,  without  any  printing). > 

run  wi th 

PRODAS  3:13+ 

DASY 

3:43+  3:13 

2:30 

(Tho  above  two  results  are 

for  loading  only  the  variables 
■stud  tabulations.) 

neaded  in 

V .  SUMMARY 

If  ono  is  given  the  data  collection  descriptions  and  the  tabulation  problem 
definitions,  published  elsewhere  in  these  proceedings,  and  one  studies  the  various 
scripts  developed  by  the  vendors  of  mi erocomputer  statistical  sottwaro  to  perform  the 
necessary  data  manipulation  for  the  tabulations,  also  published  elsewhere  In  these 
proceedings,  and  the  timing  figures  presented  above,  1  submit  the  following 
conclusion  can  bo  drawn:  the  variation  among  the  esecution  times  of  the  scripts  from 
the  various  vendors  is  negligible  compared  to  the  time  it  would  tako  a  typical  user 
to  propare  the  lob  streams. 


Clearly  this  work  could  not  have  been  done  without  the  cooperation  of  tho  vendors 
involved,  and  their  efforts  are  sincerely  appreciated.  In  addition,  the  author  wishes 
to  etpress  his  deep  gratitude  to  Thomas  Soardman  and  and  dim  lumlrunner  and  their 
staff  at  the  Statistics  Laboratory  of  Colorado  State  University  for  providing  the 
equipment  on  which  the  benchmarks  were  executed,  and  on  which  drafts  of  the  "  ... 
PROBLEMS"  and  the  "...  RESULTS"  paper  were  prepared  for  distribution  at  the 
Interface  Symposium. 


See  the  companion  paper,  "...  The  Benchmark  Problems,”  elsewhere  in  these 
proceedings . 

See  alee  tho  accompanying  papers  produced  by  the  participating  vendors  on  their 
solutions  to  the  benchmark  problems  elsewhere  in  these  proceedings. 
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A  STATISTICAL  LANGUAGE  FOR  MICROCOMPUTERS 


David  M.  Allen,  University  of  Kentucky 


1 .  INTRODUCTION 

A  programming  language  is  a  systematic  nota¬ 
tion  by  which  one  instructs  the  computer  to  do  a 
task.  A  statistical  language  is  a  programming 
language  which  makes  it  easier  to  instruct  a  com¬ 
puter  to  do  a  statistical  analysis.  It  would 
have  predefined  data  types,  default  values  for 
certain  variables,  and  understood  objects  in  cer¬ 
tain  expressions.  It  would  have  built-in 
procedures  for  application  of  the  most  frequently 
used  statistical  software. 

There  is  presently  a  large  amount  of  statisti¬ 
cal  software  available.  This  goes  by  a  variety 
of  names  including  programs,  packages,  systems, 
and  languages.  Very  few  would  satisfy  a  formal 
definition  of  a  language,  and  all  are  well  short 
of  ideal.  There  are  many  considerations  in  the 
design  of  a  statistical  language,  and  many  trade¬ 
offs  which  must  be  examined.  The  language  envi¬ 
sioned  here  can  handle  all  aspects  of  data 
analysis  that  involve  the  use  of  the  computer: 
data  entry  and  management,  editing,  application 
of  statistical  techniques,  and  report  generation. 

Once  the  language  design  is  specified  it  is 
necessary  to  write  a  translator  or  compiler  that 
will  change  statements  written  in  the  language 
into  object  code  (machine  instructions) .  The 
efficiency  of  the  data  analysis  depends  on  the 
language  design,  the  quality  of  the  translator, 
and  the  quality  of  object  code  produced  by  the 
translator.  Design  criteria  for  a  good  statisti¬ 
cal  language  are  very  similar  to  the  design 
criteria  for  a  good  general  purpose  computer 
language.  Horowitz  (1984)  gives  the  following 
list  of  criteria  for  the  design  of  general  pur¬ 
pose  computer  languages:  reliability,  fast  trans¬ 
lation,  extensibility,  well  defined  syntatic  and 
semantic  description,  efficient  object  code, 
orthogonality,  machine  independence,  provability, 
generality,  consistency  with  commonly  used  nota¬ 
tions,  subsets,  and  uniformity.  Many  of  the 
items  in  this  list  are  interrelated.  The  next 
three  sections  concentrate  on  the  first  three 
items  in  this  list,  but  several  others  will  be 
discussed  in  passing.  A  subsequent  section  will 
discuss  relevant  considerations  in  developing  a 
language  for  a  microcomputer  as  opposed  to  a 
large  mainframe  computer.  The  paper  concludes 
with  a  summary  of  desired  features  in  a  statisti¬ 
cal  language.  Implementation  of  the  features 
would  represent  a  considerable  advancement  from 
existing  statistical  software. 

2 .  RELIABILITY 

In  both  general  purpose  computer  languages 
and  statistical  languages,  a  sequence  of  state¬ 
ments  designed  to  instruct  a  computer  to  carry 
out  a  task  is  called  a  program.  Programs  written 
in  a  language  should  be  reliable.  A  number  of 
factors  contribute  to  reliability.  A  statistical 
language  should  be  similar  to  one's  native  lan¬ 
guage.  Freedom  to  place  comments  within  the 
code  also  enhances  readability  and  reliability  of 
programs. 

Some  programming  languages  require  that  vari¬ 
ables  be  declared  before  use  and  others  do  not. 


Declaration  of  variables  makes  programs  more  reli¬ 
able  and  efficient.  The  FORTRAN  statements 


HORSE  -  A+B 


HOUSE  -  C+HORSE 


are  perfectly  legal.  However,  if  HOUSE  was 
intended  to  be  HORSE,  a  programmer  may  have  trou¬ 
ble  finding  why  the  program  is  not  working  as  it 
should.  The  Pascal  programmer  is  required  to  put 
a  statement  like 

VAR  HORSE: INTEGER; 

near  the  beginning  of  his  program.  Any  occurance 
of  HOUSE  is  immediately  detected  by  the  compiler. 

The  SAS  statistical  system  tries  to  find  out  as 
such  as  it  can  about  data  and  data  structures  to 
relieve  the  programmer  of  the  burden  of  providing 
declarations.  One  convention  is  for  a  character 
variable  to  take  its  length  attribute  from  its 
first  use.  Consider  the  SAS  job 
DATA  DUMMY; 

INPUT  X; 

IF  X>3  THEN  Y-'BIG' ; 

ELSE  Y-' SMALL'; 

CARDS; 

5 

4 

3 

2 

1 

PROC  PRINT; 

The  output  will  have  "SMA"  where  "SMALL"  was 
intended.  Of  course  we  can  rearrange  things  to 
get  what  we  want.  However,  it  seems  easier  to 
always  declare  variables  than  to  be  always  mindful 
of  quirks  in  a  language. 

The  following  example  illustrates  how  the  com¬ 
bination  of  an  additional  data  type  and  declaration 
of  variables  could  improve  reliability,  ease  of 
use,  and  execution  speed.  Consider  the  SAS  job 
DATA  STUDENT; 

INPUT  CLASS  $  Y; 

CARDS; 

FRESH  2.4 
FRESH  4.0 


SOPH  3.2 
SOPH  2.3 


GRAD  2.5 
CRAD  3.2 
PROC  GLM; 

CLASSES  CLASS; 

MODEL  Y-CLASS; 

ESTIMATE  CLASS  1  -1000; 

Suppose  the  number  of  data  lines  is  large  enough 
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that  the  cost  of  running  the  job  is  not  trivial 
and  CRAD  was  intended  to  be  GRAD.  SAS  would  not 
detect  the  error  and  the  results  would  not  be 
usable.  The  ESTIMATE  statement  would  compare  a 
single  graduate  student  to  the  average  of  the 
freshmen.  SAS  does  not  have  a  qualitative  data 
type,  but  determines  the  information  it  needs  for 
each  application  with  the  CLASSES  statement.  The 
CLASSES  statement  requires  two  passes  through  the 
data:  one  to  determine  what  levels  are  present, 
and  another  to  assign  the  integer  codes. 

Suppose  we  introduce  a  data  type  like  Pascal's 
enumerated  data  type  except  that  the  string  repre¬ 
sentation  can  be  input  and  output  directly.  Near 
the  beginning  of  the  program  would  be  the  state¬ 
ments 

TYPE  CLASSTYPE-(FRESH, SOPH .JUNIOR, SENIOR, GRAD) ; 
VAR  CLASS :CLASSTYPE; 

The  character  variable  indicator  and  the  CLASSES 
statement  would  be  removed.  On  input,  any  value 
that  did  not  match  a  value  in  CLASSTYPE  would  be 
detected  as  an  error.  The  integer  representa¬ 
tions  of  the  levels  would  be  determined  on  input 
and  made  a  permanent  part  of  the  data  set.  The 
data  set  would  require  less  disk  storage.  The 
large  time  requirement  of  the  CLASSES  statement 
is  eliminated.  The  coefficients  in  the  ESTIMATE 
statement  would  now  apply  to  the  natural  order  as 
specified  in  the  TYPE  statement,  rather  than  the 
obscure  order  associated  with  the  character  set's 
collating  sequence. 

The  language  should  be  such  that  typing  errors 
are  likely  to  be  detected  by  the  translator. 

Consider  the  FORTRAN  statements 
DO  10  I  -  1.5 
A(I)  -  X  +  B(I) 

10  CONTINUE 

The  1.5  was  intended  to  be  1,5.  However,  FORTRAN 
does  not  recognize  blanks,  and  the  statement  in¬ 
tended  to  be  a  DO  statement  is  actually  the 
assignment  statement  D010I  ”  1.5.  Also,  since 
FORTRAN  does  not  require  that  variables  be  de¬ 
clared,  the  compiler  did  not  detect  an  error. 

This  example  is  from  Horowitz  (1984). 

3 .  FAST  TRANSLATION 

Translation  of  a  program  to  object  code  is 
part  of  every  analysis.  Generally,  the  speed  of 
translation  is  very  important.  We  will  discuss 
two  of  a  large  number  of  factors  that  affect 
speed  of  translation. 

One  factor  is  the  syntax  and  semantics  of  the 
language.  Semantics  deals  with  the  meanings  of 
sentences.  Syntax  refers  to  the  rules  by  which 
words  are  put  together  to  form  phrases,  clauses, 
and  sentences.  The  syntax  should  be  such  that 
backing  up  is  minimized.  Consider  the  FORTRAN 
statement 

IF  (5.EQ.MAX)  GOTO  100 

In  FORTRAN,  5.E-2  is  a  legal  number,  hence  the 
compiler  reaches  the  "Q"  before  it  knows  the 
proper  interpretation  of  the  string  of  characters 
starting  with  the  "5". 

In  FORTRAN,  the  expression  DOG(I)  could  be 
either  a  function  or  an  element  of  an  array.  In 
Pascal,  DOG(I)  is  a  function  and  DOG[I]  is  an 
element  of  an  array.  Pascal's  way  of  doing 
things  Involves  less  table  searching  and  compiles 
more  quickly.  In  addition,  programs  are  more 
reliable  and  easier  for  the  human  readers. 


Some  compilers  make  multiple  passes  over  the 
program.  This  allows  the  programmer  a  certain 
amount  of  freedom  from  attention  to  detail,  but 
requires  more  time  to  compile.  Pascal  programmers 
must  place  certain  things  in  a  specified  order, 
but  their  programs  can  be  compiled  in  a  single 
pass  with  a  one  character  look  ahead. 

Another  factor  affecting  translation  speed  is 
the  language  in  which  the  translator  is  written. 
Traditionally,  compilers  and  translators  were 
written  in  assembly  language  to  make  the  best  use 
of  the  higher  speed  portions  of  the  computer  and 
to  reduce  redundant  instructions.  Recently,  the 
C  programming  language  has  become  popular  for 
systems  work  and  compiler  writing.  The  use  of  C 
gives  a  high  degree  of  machine  independence.  Com¬ 
puter  scientists  are  currently  debating  the 
relative  efficiency  of  the  object  code  generated 
by  assembly  language  and  C  (Kernighan  and  Ritchie, 
1978).  The  editor  of  DTACK  GROUNDED  (1985)  cites 
examples  of  drastic  performance  reductions  of 
recent  versions  of  commercial  software  written  in 
C.  These  observations  regard  the  currently  popu¬ 
lar  class  of  microcomputers.  Some  study  and 
experimentation  needs  to  be  done  to  resolve  this 
issue . 

4.  EXTENSIBILITY 

Extensibility  is  the  ability  to  define  addi¬ 
tional  operators  and  objects  in  the  language.  The 
language  developer  can  not  include  all  known  sta¬ 
tistical  techniques  as  operators  in  the  language. 

The  user  should  have  the  ability  to  add  to  the 
language  the  additional  techniques  he  needs.  When 
making  extensibility  part  of  the  language,  the 
developer  must  make  some  assumptions  about  the 
programming  capabilities  of  the  user.  Generally 
speaking,  a  user  wishing  to  make  worthwhile  exten¬ 
sions  should  be  able  to  program  in  a  higher  level 
language  like  FORTRAN  or  Pascal.  He  should  not, 
however,  be  expected  to  know  a  lower  level  lan¬ 
guage  like  assembly  or  C.  This  would  severely 
limit  the  number  of  people  able  to  write  extensions. 

The  statistical  systems  SAS  (SAS  Institute, 

1985)  and  S  (Becker  and  Chambers,  1974)  provide 
forms  of  extensibility.  Both  have  a  well  defined 
grammar  for  an  interface  language.  However,  the 
technical  knowledge  required  of  the  individual 
implementing  the  extensions  is  more  than  we  feel 
necessary.  The  Forth  programming  language  is 
extended  by  entering  the  extensions  and  typing 
FREEZE.  Turbo  Pascal  and  UCSD  Pascal  allow  for 
automatic  overlays.  These  software  products  dem¬ 
onstrate  that  extensibility  need  not  be  difficult 
for  the  user. 

5.  SPECIAL  CONSIDERATIONS  FOR  MICROCOMPUTERS 

Currently,  the  most  popular  class  of  microcom¬ 
puters  has  an  Intel  8088  processor,  the  MS-DOS 
or  similar  operating  system,  about  256  kilobytes 
of  random  access  memory,  approximately  720  kilo¬ 
bytes  disk  storage,  and  some  graphics  capability. 

The  statistical  language  proposed  here  could 
certainly  be  implemented  on  this  class  of  computer. 
These  computers  are  slow,  and  thus  a  good  batch 
facility  is  needed  to  provide  for  unattended 
operation.  A  procedure  to  check  the  syntax  of 
programs  and  for  the  presence  of  requested  data 
sets  and  devices  would  remove  most  of  the  causes 
of  unsuccessful  runs.  An  overlaying  capability  is 
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needed  to  deal  with  the  limited  amount  of  random 
access  memory.  In  situations  where  disk  space  is 
limited,  the  user  should  be  able  to  remove  proce¬ 
dures  he  never  uses. 

While  the  language  could  be  designed  to  oper¬ 
ate  on  the  computers  described  above,  it  seems 
more  reasonable  to  design  for  the  popular  compu¬ 
ter  of  three  years  from  now.  Today's  leading 
edge  microcomputers  like  the  IBM  PC/AT  or  AT&T 
Unix  PC  would  seem  to  be  logical  choices  as  tar¬ 
get  machines. 

The  capabilities  of  microcomputers  and  main¬ 
frame  computers  are  both  increasing  rapidly  with 
time.  There  will  be  a  point  when  a  microcomputer 
programmer  will  not  have  to  design  around  speed 
and  memory  limitations.  Operating  systems  will 
provide  more  support  to  make  language  implementa¬ 
tion  easier.  However,  it  is  likely  that  a  large 
proportion  of  microcomputer  users  will  continue 
to  be  isolated  from  the  technical  support 
received  by  many  mainframe  users.  Hence,  the  lan¬ 
guage  must  be  well  documented  and  easy  to  use. 

6.  FEATURES  OF  THE  ENVISIONED  STATISTICAL  LANGUAGE 

In  view  of  the  previous  discussion,  a  number 
of  features  are  clearly  desirable  in  a  statisti¬ 
cal  language.  These  features  are  restated  here 
in  summary  form. 

The  statistical  language  would,  to  the  extent 
possible,  satisfy  the  stated  criteria  for  language 
design.  Note,  however,  that  a  gain  in  machine 
independence  is  likely  to  result  in  a  decrease  in 
efficiency  of  object  code.  Other  compromises  may 
also  be  necessary. 

The  syntax  of  the  language  would  be  very  simi¬ 
lar  to  that  of  Pascal.  Pascal  compiles  rapidly 
in  one  pass  and  is  relatively  free  of  quirks. 
Pascal  is  presently  taught  in  high  schools.  Hun¬ 
dreds  of  thousands  of  copies  of  a  Pascal  compiler 
for  microcomputers  have  been  sold.  Thus,  many 
people  already  know  Pascal  and  could  learn  the 
statistical  language  with  little  effort. 

Data  types  of  all  variables  must  be  declared. 
Thi3  will  make  programs  more  reliable  and  faster 
in  translation.  The  data  type  of  variables  in 
data  sets  would  be  stored  in  the  data  set  and 
hence  repetitive  declarations  would  not  be 
required.  A  data  type  similar  to  Pascal's  enu¬ 
merated  data  type  would  be  present.  This  would 
make  analysis  of  data  with  qualitative  or 


categorical  variables  more  reliable  and  more 
efficient. 

Some  of  the  ways  in  which  the  proposed  language 
would  differ  from  Pascal  follow.  Input/output 
facilities  for  enumeration  data  types  would  be 
implemented.  The  file  handling  capabilities  would 
be  extended  to  allow  indexed  files.  Techniques 
for  parameter  passing  to  procedures  would  be  modi¬ 
fied  along  the  lines  of  Ada  or  S  to  allow  for 
default  values  and  more  ease  of  use. 

Extending  the  language  would  be  as  easy  as 
writing  a  Pascal-type  program  and  telling  the 
language  to  attach  the  program  to  itself.  Nearly 
all  of  the  procedures  and  functions  in  the  lan¬ 
guage  would  be  available  to  the  person  writing  the 
extension. 

The  language  interacts  with  the  user  on  at 
least  two  levels.  The  extension  programming  level 
would  be  similar  to  the  Turbo  Pascal  environment 
where  one  can  go  quickly  back  and  forth  between 
the  editor  and  compiler.  The  statistical  analysis 
level  would  have  to  be  command  driven  to  be  con¬ 
sistent  with  the  batch  capability.  However,  an 
interactive  shell  would  be  available  to  assist 
beginning  users. 
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MATRIX  LANGUAGES  FOR  STATISTICS 


Kenneth  N.  Berk 


X.  Introduction 

Packaged  statistical  programs  are  useful,  but 
there  is  much  that  they  do  not  do.  Languages 
such  as  FORTRAN,  PASCAL,  and  C  can  be  used  as  an 
alternative,  but  much  effort  is  required  to  use 
than.  It  is  easier  to  program  in  a  language 
with  facilities  for  manipulation  of  matrices  and 
building  blocks  for  statistics  such  as 
probability  distributions,  randan  number 
generation,  plotting,  ranking,  and  functions  for 
linear  model  analysis. 

This  paper  discusses  s«en  matrix  languages  - 
APL  canbined  with  STATCRA FH  ICS,  CRIBS,  SA S  ML 
and  MATRIX,  MATIAB,  S,  and  SPEAKEASY  -  which  are 
listed  along  with  the  vendor  addresses  in  Figure 

1.  Tables  1  through  12  indicate  the  presence  of 
various  features  in  these  languages.  Section  2 
explains  these  features  and  Sections  3  through  8 
discuss  the  languages  individually.  Finally  the 
last  section  gives  a  summary  along  with  a 
discussion  of  sane  desirable  features  for  future 
impl  amenta  t  i  on . 

Note  that  the  author  has  not  used  APL,  IML, 
or  S.  Their  importance  is  such  that  it  was 
thought  best  to  include  than,  based  on 
documentation  and  other  sources.  The  author  is 
indebted  to  W.  Gerald  Platt  of  San  Francisco 
State  for  the  APL  benclmarks  and  Alan  Eaton  of 
SAS  Institute  for  the  ML  benchmarks. 

2.  The  Tables 

Perhaps  Table  1  is  self-explanatory,  except 
that  "PROTECTED"  refers  to  copy  protection.  In 
Table  2,  it  is  indicated  that  all  of  the 
languages  are  interpreted.  Here,  "DIARY  FILE" 
refers  to  the  ability  to  store  the  results  of  an 
interactive  session  for  later  editing  and, 
possibly,  execution  as  a  batch  file.  "PRIOT 
TOGGLE"  is  the  ability  to  turn  on  and  off  the 
automatic  output  of  all  assignment  statements. 
This  is  a  valuable  aid  in  debugging. 

In  Table  3  "VECTOR"  refers  to  the  ability  to 
reference  one-dimensional  structures  by  one 
subscript.  Those  without  this  feature  require 
two  subscripts. 

Each  of  the  languages  have  same  sort  of 
subroutine  or  macro  structure,  as  indicated  in 
the  last  column  of  Table  5.  Here,  "LOCAL" 
refers  to  the  variables  being  local  to  the 
structure.  "CCMPILE"  means  that  subroutines  can 
be  precompiled. 

In  Table  7  an  " I(BEX  VECTOR"  consists  of  the 
integers  from  1  to  N.  The  "SUBMATOIX"  extracted 
consists  of  the  first  three  rove. 

In  Table  8  the  abbreviations  stand  for 
|  Choleski ,  eigenvalues  and  eigenvectors,  singular 

value  decomposition,  generalized  inverse,  QR 
|  decomposition,  Gr*m-Sctimidt  decomposition,  and 

i  the  fast  Fourier  transfoun. 

i  The  reason  for  having  the  "KRONECKER  ERODUCT" 

|  in  table  9  is  that  it  can  be  used  for  creating 

1  interaction  terms  in  the  analysis  of  variance. 

'  The  entries  for  STATGRAEH  BCS  and  SAS  here  in 

|  parentheses  show  alternatives  vfiich  are  easier 

I  to  use  for  the  same  purpose, 

i 
I 
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The  "VERSATILE"  entry  in  Table  11  is  intended 
to  indicate  powerful  features  in  the  given 
category.  "CG"  refers  to  the  color  graphics 
card. 

The  benchmarks  in  Table  12  were  obtained  with 
the  8088/8087  (IBM  PC  and  clones)  ,  except  that 
for  ML  they  were  obtained  with  the  80286/80287 
(IBM  AT) . 

3.  APL  /  STAT®  AWICS 

Created  by  Kenneth  Iverson  in  the  late  50  *s, 
APL  is  the  oldest  of  the  languages  considered 
here.  It  has  its  own  character  set  and  to 
outsiders  the  programs  may  be  difficult  to 
understand.  This  is  in  contrast  to  the  other 
matrix  languages,  all  of  vfiich  resamble  ordinary 
matrix  algebra.  Nevertheless,  APL  has  a  strong 
following,  and  the  APL  conferences  are  well- 
attended.  It  seems  to  be  the  only  matrix 
language  available  on  the  Macintosh  (but  MATIAB 
will  be  available  soon)  and  there  are  a  number 
of  versions  available  for  PC -COS  /  MS-DOS 
machines,  including  four  reviewed  by  smith 
(1985).  Unfortunately,  APL  by  itself  does  not 
include  eigenanalysis,  tail  probabilities, 
quantile  plots,  etc.,  so  that  it  is  advantageous 
to  obtain  such  features  additionally.  One  way 
to  do  this  is  with  the  STATCRA  PH  ICS  package. 

The  tables  in  this  paper  refer  to  the 
combination  of  STATCRA  PH  ICS  and  APL  marketed  by 
STSC,  Inc. 

Although  STATCRAPHICS  is  a  menu-driven 
statistical  package,  all  of  the  APL  source  is 
available  as  building  blocks  for  the  APL 
programmer.  On  the  other  hand,  it  requires  sane 
effort  to  access  the  APL  source  code.  Platt  and 
Platt  (1985)  describe  the  combination  of  APL  and 
STATCRAPHICS  as  being  very  powerful  in  its 
capabilities.  Tables  1-12  verify  that  there  are 
a  wide  range  of  features,  especially  in  graphics 
and  distributions.  If  one  does  not  care  for  the 
APL  character  set,  it  is  possible  to  use  instead 
a  word-oriented  syntax. 

MS-DOS  commands  can  be  issued  from  wi  thin  the 
system,  and  outside  programs  can  be  run  vhile 
preserving  the  current  session. 

There  are  two  major  criticisms  of  APL. 

First,  note  in  Table  5  that  there  is  very  little 
provision  for  structured  programming.  Of 
course,  the  language  was  specified  more  than  25 
^ears  ago,  when  less  emphasis  was  given  to 
structuring  code  in  blocks  for  easy  readability 
and  easy  debugging.  In  APL  a  "do  loop"  must  be 
built  by  setting  and  incrementing  a  counter  and 
using  the  APL  equivalent  of  the  "go  tcf 
command.  There  are  no  "if*  statements,  so  the 
conditional  exit  from  the  loop  must  irvolve  a 
computation  such  as  a  logical  function.  It 
should  be  emphasized  that  there  are  many  APL 
enthusiasts  vho  do  not  consider  the  language 
deficient  in  programing  structures. 

The  second  criticism  of  APL  is  in  the  area  of 
computational  efficiency.  As  is  indicated  in 
Table  12,  there  are  programs  which  may  take  as 
much  as  ten  times  as  long  in  APL  as  compared  to 
other  languages.  For  some  purposes  this  may  not 
be  a  serious  drawback,  but  for  a  Monte  Carlo 
study  it  could  imake  a  significant  difference  in 


rim  time.  A  call  to  STSC  did  not  produce  any 
encouragement  about  speed  fixes  on  the  horizon. 
The  suggestion  frcm  STSC  is  to  find  the 
bottleneck  and  replace  it  with  assembler  or  C 
code,  vhich  can  be  linked  with  APL.  This 
philosophy  has  been  implemented  in  STATQIABUCS, 
where  the  inversion  operator  has  been  rewritten 
in  C. 

4.  GAUSS 

CAUSS  is  only  a  few  years  old.  It  runs  only 
on  MS-DOS  computers  with  the  math  coprocessor 
chip.  At  $250  it  is  the  least  expensive  of  the 
seven  languages  compared  here.  Also,  its 
hardwire  demands  are  relatively  simple  in  that 
it  requires  only  256K  of  memory  and  no  hard 
disk. 

CAUSS  has  a  simple  full  screen  editor  with  no 
copy  or  search  facilities.  Users  may  as  an 
alternative  use  their  owi  editors.  Programs  can 
be  run  interactively  from  the  GAUSS  editor.  In 
the  interactive  mode  one  has  access  to  what  has 
just  been  run,  and  the  commands  can  be  edited 
rather  than  retyped.  A  screen  full  of  commands 
from  an  interactive  session  can  be  saved, 
edited,  and  run  as  a  batch  job. 

There  is  a  looseleaf  manual  (dated  1984  vhich 
describes  the  package  as  of  that  time.  There 
are,  however,  a  fair  number  of  features  vhich 
have  been  added  since  then.  They  are  documented 
on  the  disks,  but  the  quality  of  the 
explanations  leaves  room  for  improvement.  There 
are  also  two  disks  of  programs  written  in  (AISS, 
and  they  tow  are  documented  only  on  the  disks. 

It  would  be  nice  to  have  written  documentation 
of  good  quality  for  the  vhole  system.  Online 
help  would  also  add  to  the  ease  of  use.  As  it 
is,  the  looseleaf  documentation  is  barely 
adequate,  with  much  left  for  the  user  to  figure 
out,  and  the  disk  documentation  is  of  lower 
quality.  The  latest  news  from  the  authors  is 
that  new  written  and  online  documentation  is  on 
the  way. 

As  in  SAS  MATRIX,  all  data  are  in  the  form  of 
matrices  with  two  subscripts.  Character  data 
are  allowed  and  CAUSS  permits  mixing  of 
character  and  numerical  data  in  arrays,  but  the 
user  must  tell  CAUSS  vihich  is  wh ich  for 
printing.  There  is  a  weakness  in  the  the  area 
of  input  files.  CAUSS  has  a  program  to  convert 
ASCII  files  to  its  own  fonnat,  but  it  is  rather 
limited  in  that  the  data  items  must  be  separated 
by  spaces  in  the  ASCI  I  f  ile . 

CAUSS  interacts  well  with  DOS.  The  program 
shares  with  others  the  ability  to  interrupt  a 
session,  execute  DOS  commands  and  run  programs 
of  any  kind,  and  then  return.  A  newly  added 
feature  includes  links  to  the  common  FORTRAN 
compilers. 

CAUSS  does  not  have  "DO"  loops,  but  it  does 
have  a  "DO. ..WHILE"  structure  vhich  requires 
setting  and  incrementing  a  counter.  There  is  an 
"  IF. . . ELSE  IF. .  .ELSE. .  .END"  structure.  The  1984 
GAUSS  had  subroutines  with  nonlocal  parameters 
in  the  style  of  BASK.  Recently,  compilable 
subroutines  with  local  parameters  have  been 
added,  but  they  are  not  very  well  documented. 

GAUSS  shares  with  MATIAB  the  use  of  a  " ." 
before  an  operator  to  indicate  an  elementwise 
operation.  For  example,  "A.*B"  yields 
elanentwise  multiplication  of  A  and  B.  A 
sequence  of  conseoutive  integers  is  easily 


specified  in  GAUSS,  IML,  MATRIX,  and  MATIAB 
using  the  notation.  For  exampule,  "2:5" 
spuecifies  the  integers  from  2  to  5,  inclusive. 

In  CAUSS  this  is  alloved  only  in  subscripts, 
vhereas  the  others  allow  assignment  statements 
of  the  fotro  "1=2:5". 

CAUSS  has  available  a  good  set  of  tail 
probabilities,  including  nozmal,  t,  chi-square, 
and  F.  There  are  no  inverse  probability 
functions  on  the  main  disk,  but  one  of  the  two 
other  disks  has  a  CAUSS  program  that  obtains  the 
t  inverse  by  the  Newton-Raphson  method. 

Graphics  was  not  a  high  priority  in  the  1984 
edition  of  GAUSS.  Only  simple  non-hires  scatter 
plots  with  no  axes  are  described  in  the  manual. 
There  are,  however,  sane  new  graphics  functions 
on  the  two  supplementary  disks,  and  more 
functions  are  on  the  way,  using  the  color 
graphics  adapter. 

The  advertising  copy  for  GAUSS  brags  about 
its  speed,  and  benchmarks  verify  that  the 
program  is  fast,  fy  the  use  of  the  Or  out 
algorithm  for  inversion  and  the  use  of  assembler 
to  make  sure  that  the  math  coprocessor 
accumulates  dot  products  internally,  the  a ut Ivors 
have  achieved  excellent  times  of  9.8  seconds  for 
a  50  x  50  multiply  and  14  seconds  for  a  50  x  50 
inverse,  as  shown  in  Table  12. 

Examples  and  additional  benchmarks  are 
available  in  a  papier  by  Platt  and  Platt  (1985). 

5.  SAS  IML  and  MATRIX 

SAS  has  been  furnishing  MATRIX  with  its  base 
product  since  the  late  1970's,  but  SAS  is  now 
phasing  it  out  in  favor  of  a  separate  prod'  t 
called  IML  (Interactive  Matrix  Language). 

Version  5  IML  runs  on  32-bit  mini  computers  and 
IBM  mainframes,  and  Version  6  IML  runs  on  IBM 
microcomputers.  The  emphasis  here  is  on  the 
microcomputer  version,  although  this  version  has 
not  been  released  jet .  Note  that  the  marketing 
of  IML  for  .tmi croc amputers  is  done  on  a  site 
lease  basis  vhich  is  designed  to  appeal  to  large 
organizations.  There  is  no  provision  for 
individual  purchase.  The  hardware  requirements 
are  substantial  -  at  least  512K  of  memory  and 
more  than  5  megabytes  of  hard  disk  sp>ace.  It 
runs  in  PC -DOS  and  same  versions  of  IC-DOS. 

Those  who  have  used  the  SPF  editor  in  IBM's 
TSO  may  be  pleased  that  a  similar  editor  is 
included  in  the  microcomputer  imp! amenta t ion  of 
SAS.  One  can  accumulate  commands  in  the  editor 
and  submit  than  vhene/er  it  is  desired  to 
process  commands.  Online  help  is  available. 
MATRIX  has  a  facility  (PROC  MATRIX  ERINT)  which 
causes  output  from  each  computation  to  be 
printed.  IML  improves  on  this  by  allowing  the 
feature  to  be  toggled  on  and  off.  The  other 
languages  have  it,  too,  except  for  MATIAB.  It 
is  very  helpful  in  debugging. 

Another  advantage  that  IML  has  over  MATRIX  is 
the  ability  to  reference  a  one-dimensional  array 
by  only  one  subscript.  In  MATRIX,  the  elements 
of  a  vector  are  ”V(1,I)"  or  "7(1,1)",  whereas 
they  can  be  called  "V(Il"  in  IML  Version  6. 

MATRIX  has  only  BASK-style  subroutines  with 
global  variables,  but  IML  has  also  subroutines 
with  local  variables.  IML  shares  with  others 
the  ability  to  interrupt  a  session  and  run  DOS 
commands  or  any  programs,  including  FORTRAN, 
etc. 
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One  of  the  outstanding  features  of  MATRIX 
which  is  carried  over  into  IML  is  the  subscript 
reduction  operator,  iJuch  is  similar  to  the 
reduction  operator  of  APL  and  others,  but 
operates  directly  on  the  subscripts.  There  are 
a  variety  of  useful  operators,  including  the 
mean  operator  shown  in  Table  6.  Moving  the 
operator  frcm  the  first  to  the  second  subscript 
switches  fran  means  of  columns  to  means  of  rows. 

Note  that  the  only  difference  between  MATRIX 
and  IML  in  Table  7  is  that  IML  uses  square 
brackets  for  subscripting.  The  subscript  syntax 
of  Version  5  is  a  little  messier  than  vrtiat  is 
showi  for  Version  6  (microcomputers)  in  Table  7. 

IML  and  MATRIX  have  a  wide  range  of 
mathematical  and  statistical  functions.  Seme  of 
the  more  esoteric  operators  such  as  Gram-Sctmidt 
orthogonal ization  and  the  fast  Fourier  transform 
are  available  in  MATRIX  and  Version  5  of  IML  but 
not  yet  in  Version  6. 

For  doing  analysis  of  variance  by  regression 
a  facility  is  needed  for  obtaining  design 
matrices  (creation  of  dummy  variables)  ,  which 
both  SAS  languages  do.  Furthermore,  the 
horizontal  direct  product  is  available  ("HDIR" 
in  IML,  "@|"  in  MATRIX)  for  forming  interaction 
terms.  Similarly,  STATGRAFHICS  has  a  "CROSS" 
operator.  Other  languages  can  accomplish  this 
with  the  Kronecker  product,  but  it  is 
comparatively  awkward,  especially  in  the 
ui balanced  case.  Note  that  the  SAS  languages 
also  have  an  excellent  orthogonal  polynomial 
facility  which  works  well  in  the  unbalanced 
case. 

SAS  MATRIX  has  quite  a  complete  set  of  tail 
probabilities,  inverse  probabilities,  and  randan 
number  generators.  The  inverse  probabilities 
include  only  the  normal  and  beta,  but  the  t,  chi- 
square,  and  F  can  be  obtained  from  the  beta.  On 
the  other  hand,  the  only  inverse  available  in 
IML  is  the  normal,  and  random  numbers  are 
available  only  from  the  uniform  and  normal 
distributions. 

MATRIX  has  to  rely  on  communication  with  the 
rest  of  SAS  to  do  graphics,  but  IML  Version  5 
has  a  substantial  graphics  component.  This 
includes  facilities  for  spline  fits  and 
character  labels  (of  arbitrary  length)  for  the 
points  on  a  plot.  There  is  as  yet  no  such 
facility  in  the  microcomputer  version. 

6.  MAT  LAB 

MATLAB  has  been  available  for  about  five 
years  as  a  FORTRAN  program  on  large  computers. 
This  rwiew  is  based  on  a  demo  disk  for  PC- 
MATIAB,  which  is  an  enhanced  version  written  in 
C  for  MS-DOS  microcomputers.  This  will  soon  be 
available  also  for  the  Macintosh  and  VAX  VMS. 

The  emphasis  is  more  on  numerical  analysis  and 
engineering  than  on  statistics.  Not  all 
documented  functions  are  present  on  the  demo 
disk.  Notice  in  Table  1  that  MATLAB  is  the  only 
one  of  the  microcomputer  languages  reviewed  that 
is  copy  protected. 

The  program  has  an  interactive  mode  in  which 
commands  are  executed  as  they  are  given.  There 
is  also  a  batch  mode,  but  the  command  file  must 
be  formed  elsewhere  because  MATLAB  has  no  batch 
editor.  There  is  provision  for  a  diary  file  in 
which  interactive  commands  can  be  accumulated, 
but  the  file  also  contains  the  output,  which 
would  need  to  be  edited  out  to  get  a  batch 


command  file.  Oie  does  not  need  to  retype  an 
erroneous  command  in  interactive  mode  because  a 
line  editor  is  available  to  edit  the  last 
command. 

MATLAB  has  a  useful  online  help  facility. 
Typing  "HELP"  causes  all  of  the  commands  to  be 
listed  on  one  screen.  Choosing  a  commend  from 
this  list,  one  can  then  get  more  detailed  help. 

There  is  not  nurh  ability  to  label  output. 
Character  data  are  allowed  but  they  cannot  be 
used  to  label  the  rows  of  a  numerical  matrix. 

With  others,  MATLAB  shares  the  ability  to 
suspend  a  session  and  run  a  DOS  command  or  any 
program.  The  MATLAB  documentation  suggests  tlat 
a  FORTRAN  program  can  be  used  with  MATIAB  by 
first  storing  data,  ruining  the  FORTRAN  program 
on  the  data,  and  then  reading  back  into  MATIAB 
the  FORTRAN  results. 

MATLAB  allows  subroutines  with  local 
variables.  These  subroutines  can  be 
precompiled. 

Given  that  the  program  originated  with  the 
well-known  numerical  analyst  Cl  eve  Moler,  it  is 
not  surprising  that  MATIAB  contains  a  full 
complement  of  procedures  for  numerical  analysis, 
as  shown  in  Table  8.  On  the  other  hand.  Table 
10  shows  that  there  cure  no  tail  probabilities 
available,  and  none  of  the  corresponding 
inverses,  although  noomal  and  unifonm  data  can 
be  generated. 

For  graphics,  MATIAB  sup>po rts  the  Hercules 
card,  IBM  color  graphics,  and  IBM  enhanced 
graphics,  as  does  STATCSAIHICS.  It  is  possible 
to  overlay  plots,  and  there  is  provision  for 
showing  strata  such  as  male  and  female  on  one 
plot.  Nonnal  plots  and  exploratory  data 
analysis  commands  are  not  included. 

The  benchmarks  in  Table  12  are  quite  good. 


S  was  written  to  work  within  the  UNIX  system 
at  Bell  Labs,  and  uitil  a  recent  implementation 
on  VAX  VMS,  it  ran  only  in  the  UNIX 
environment.  Although  there  are  AT&T 
microcomputers  that  supp»rt  UNIX,  the  smallest 
machine  for  which  AT&T  recommends  S  is  the  3B2. 
Note  that  S  costs  $8000,  except  that  it  costs 
only  $400  for  universities.  This  rwiew  is 
based  mainly  on  the  manual  by  Becker  and 
Chambers  (1984)  . 

Both  interactive  and  batch  modes  are 
supported  by  S.  Batch  files  can  be  edited  by  a 
Unix  editor. 

The  manual  shows  excellent  facilities  for 
character  data  and  the  labeling  of  output. 
Mjltidimensional  arrays  are  allowed,  as  is  true 
of  APL  but  not  the  others,  although  facilities 
for  character  manipulation  can  be  used  in  GAUSS 
and  SPEAKEASY  to  index  2-dimensional  arrays. 

The  program  is  integrated  into  UNIX  -  it 
allows  UNIX  canmands  and  UNIX  programs  in  C, 
FORTRAN,  and  PASCAL  to  be  executed  from  within 
S.  Extensions  to  S  can  be  accomplished  with 
macros . 

There  is  a  sweep  operator  in  S,  but  it  is  for 
centering  data,  and  not  for  doing  Beaton  (1964) 
sweeps. 

The  available  commands  include  substantial 
facilities  for  regression  analysis,  even  the 
Ehrnival-Wilson  (1974)  leaps  and  bounds  all¬ 
subsets  regression  algorithm.  On  the  other 
hand,  there  is  not  much  available  for  analyzing 


designed  experiments.  In  particular,  there  are 
no  canmands  for  forming  dummy  variables. 

The  facilities  for  probability  distributions, 
inverse  probability  distributions,  randan  number 
generators,  and  quantile  plots  cure  among  the 
best  available,  and  they  are  nsmed  consistently 
so  as  to  be  easily  remanbered.  S  also  has  one 
of  the  best  plotting  facilities  available. 

Points  can  be  labeled  effectively,  e.g.,  with 
two-letter  state  abbreviations  if  the  points 
represent  states.  The  values  of  a  third 
variable  can  be  indicated  by  using  the  values  of 
the  third  variable  to  determine  the  size  of 
circles  centered  at  the  points. 

The  S  program  has  impressed  many 
statisticians  who  have  had  a  chance  to  use  it. 
For  a  very  favorable  review  of  the  manual  and 
the  program,  see  Larntz  (1986). 


8.  SPEAKEASY 

SPEAKEASY  has  been  available  on  various  IBM 
canputers,  and  more  recently  the  DEC  VAX,  sinoe 
the  1960's,  written  in  FORTRAN,  it  now  also 
runs  on  MS-DOS  microcomputers.  Although  the 
microcanputer  version  does  not  have  all  of  the 
mainframe  features,  it  does  have  over  600 
f  met  ions  available.  It  uses  Intel  FORTRAN  and 
requires  640K,  just  as  BMDP  has  done.  SPEAKEASY 
uses  the  full  memory,  with  nothing  else  allowed 
in  the  way  of  coresident  pcograns.  It  also 
takes  about  5  megabytes  of  hard  disk  space. 

The  size  of  SPEAKEASY  makes  it  harder  to 
find  the  right  canmand  than  it  would  be  in  a 
smaller  syston  such  as  MAT LAB,  where  the  "HELP" 
canmand  puts  the  canpiete  canmand  list  on  one 
screen.  The  "HELP"  carmand  in  SPEAKEASY  puts  a 
list  of  canmand  categories  on  the  screen,  and 
these  can  be  examined  with  more  detailed  help 
canmands.  In  general,  there  is  roan  for 
improvement  in  the  documentation.  There  is  very 
little  documentation  specific  to  the 
microcanputer  version,  and  it  is  not  always 
clear  how  to  use  the  canmands. 

Caimands  can  be  executed  interactively  or  in 
batch  mode,  and  a  line  editor  is  available  to 
edit  batch  caimands.  The  "JOURNAL"  carmand  is 
available  to  toggle  on  and  off  the  autanatic 
output  of  canmands  anchor  results.  Gy  using  it 
to  record  canmands  in  interactive  mode,  one  can 
form  a  file  that  can  be  edited  and  run  in  batch 
mode. 

SPEAKEASY  does  a  fine  job  with  character  data 
and  the  "TABUIATE"  caimand  gives  nicely  labeled 
printouts  of  arrays.  Ling  (1985)  has  given 
examples  of  this  feature. 

Another  useful  feature  stressed  by  Ling  is 
the  ability  to  interrupt  a  session  and  execute 
DOS  canmands  and  programs  such  as  FORTRAN.  A 
FORTRAN  link  is  available,  but  only  to  Intel 
FORTRAN.  A  recently  added  feature  in  SPEAKEASY 
is  the  inclusion  of  subroutines  with  local 
parameters. 

There  are  two  kinds  of  two-dimensional 
structures,  arrays  and  matrices.  If  A  and  B  are 
arrays  of  the  seme  size,  then  "A*B"  yields  the 
elonentwise  product,  of  that  size.  If  A  and  B 
are  matrices  of  appropriate  size,  then  "A* B"  is 
the  matrix  product .  Thus  caimands  are  necessary 
to  convert  fran  matrix  to  array  ("AFAM")  and 
array  to  matrix  ("MFAM")  ,  to  assure  that  the 
desired  operation  is  performed. 


The  last  two  columns  of  Table  7  may  require 
some  explanation.  If  L  and  R  each  have  N  rove 
and  L  las  M  columns,  then  they  can  be  joined  by 
setting  ”L[1,M+1  ]=R"  .  The  procedure  for 
vertical  concatenation  of  arrays  T  and  B  with 
the  seme  number  of  columns  is  to  set  MT(N+1,1]  = 
B"  if  T  las  N  rows.  In  general,  U(I,J)-V  allows 
you  to  insert  V  anywhere  in  0,  regardless  of 
their  dimensions,  and  SPEAKEASY  will  adjust  the 
dimensions  of  the  result. 

SPEAKEASY  has  a  substantial  set  of  graphics 
can rrends  using  both  text  characters  and  color 
graphics.  Plots  can  be  displayed  interactively, 
wi  th  a  new  graph  overlayed  on  one  that  is 
already  on  the  screen.  Labels  can  similarly  be 
added.  This  is  the  only  way,  however,  to 
display  separate  strata  with  separate  symbols. 
Thus,  the  procedure  would  be  awkward  if  there 
are  several  strata,  because  each  stratum 
requires  a  separate  plot  command,  although  the 
caimands  could  be  placed  in  a  loop. 

Being  written  in  FORTRAN,  SPEAKEASY  is  not  as 
fast  as  MAT LAB  and  GAUSS,  as  shown  in  Table  12. 
Unfortunately,  it  does  not  seon  possible  with 
current  PORTRANs  to  keep  up  with  optimized 
assonbler  and  C  code  for  the  math  coprocessor. 


9.  Conclusions  and  Recommendations 

The  headings  in  Tables  1-12  indicate  seme  of 
the  features  which  are,  in  the  author's  opinion, 
important  in  a  statistical  matrix  language.  It 
is  hoped  that  authors  of  such  languages  and 
reviewers  of  than  will  take  into  account  the 
listed  features. 

One  conclusion  that  can  be  drawn  fran  the 
tables  is  the  surprising  degree  of  similarity 
among  the  various  matrix  languages.  Especially 
in  Tables  6  and  7  there  are  strong  parallels 
among  than.  Their  capabilities  and  notations 
sure  very  similar  for  matrix  operations,  indices, 
subsetting,  and  concatenation. 

There  are,  of  course,  significant  differences 
among  the  languages.  To  seme  extent  these  gaps 
tend  to  narrow,  perhaps  because  the  authors  are 
aware  of  what  others  are  doing,  perhaps  because 
their  users  are  aware  of  what  others  are  doing 
and  danand  similar  facilities.  One  might  wonder 
if  it  is  coincidental  that  SEEAKEASY,  GAUSS,  and 
MATIAB  have  all  added  recently  subroutines  with 
local  variables,  and  IML  also  has  this  feature, 
although  its  predecessor  MATRIX  does  not. 

In  the  area  of  plots  and  probability 
distributions,  it  would  be  nice  to  see  other 
languages  copy  fran  the  example  set  by  IML 
Version  5,  S,  and  STATGRAEHICS.  Especially  for 
plots,  which  cannot  be  readily  programed  by  the 
user,  it  would  be  good  to  see  same  of  the  other 
languages  add  sane  features.  Statistical 
detective  work  is  enhanced  by  we  11- labeled 
plots,  and  every  language  should  be  able  on 
scatter  plots  to  distinguish  different  strata 
and  to  label  each  point  with  the  values  of  an  ID 
variable. 

There  are  some  features  that  are  available  in 
few  if  any  of  these  languages.  Many  of  them  are 
not  very  hard  to  implanent  in  a  matrix  language, 
but  sane  are  quite  difficult.  In  the  not  so 
hard  class  are  multivariate  normal  variate 
generation  (available  in  STATCRARUCS)  and 
Wishart  matrix  generation,  as  described  by 
Kennedy  and  Gentle  (1980).  More  difficult  are 
LI  estimation  and  all-subsets  regression.  The  S 
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package  includes  the  Furnival-Wilson  (1974) 
algorithm,  which  is  ranarkably  efficient  and  and 
rsnarkably  intricate.  Many  such  difficult 
algorithms  are  obtainable  free  or  at  low  cost  in 
FORTRAN,  so  the  language  with  FORTRAN  linkages 
will  allow  the  user  great  latitude  in 
algorithms,  beyond  vdat  is  an  integral  part  of 
the  language. 

It  could  be  argued  that  time  series  operators 
should  included  among  the  hard- to- program 
features  that  should  be  included  in  these 
languages,  but  there  are  specialized  languages 
such  as  RATS  that  include  many  time  series 
functions  along  with  a  matr  ix  language.  Fbr 
serious  time  series  analysis,  such  specialized 
languages  are  better  suited  thian  any  of  the 
microccmputer  languages  considered  hiere,  except 
that  STATCRAPH ICS  has  strong  time  series 
features  and  is  a  contender  in  this  area. 

Sometimes  textbook  algorithms  are  wasteful  of 
computer  space  and  time.  One  example  is  the 
formation  of  a  large  square  matrix  fran  which 
only  the  diagonal  is  needed.  Hie  IML  Version  6 
manual,  p.116,  follows  such  an  algorithm  in 
forming  a  n  x  n  projection  (hiat)  matrix,  which 
will  be  very  large  if  the  nunber  of  cases  n  is 
large,  and  then  using  only  the  diagonal  (for 
leverages  and  confidence  intervals).  It  might 
be  asking  too  much  to  expect  that  an  interpreter 
would  recognize  that  only  the  diagonal  is 
needed,  and  produce  only  the  needed  part, 
although  this  is  a  simpler  form  of  optimization 
than  is  used  in  some  FORTRAN  compilers.  At  a 
minimum,  the  manual  should  show  how  to  compute 
what  is  needed.  That  is,  it  should  be  explained 
that  a  more  efficient  equivalent  of 

VECDIAG  (X*I  NV  (X  '  *X)  *X  ' ) 
is 

(X*  (INV  (X  1  *X)  )#X )  (  ,+)  . 

The  second  method  need  not  be  recanmended  in  all 
instances,  but  only  when  n  is  large. 

There  are  two  disparate  goals  fbr  these 
matrix  languages.  On  the  one  hand  they  should 
be  interactive  and  easily  debugged,  and  on  the 
other  hand  they  should  be  efficient.  The  first 
goal  is  more  easily  met  by  an  interpreted 
language,  and  the  second  goal  is  more  easily  met 
by  a  canpiled  language.  An  ideal  canpr anise 
would  involve  a  dual-mode  language  which  would 
allow  debugging  in  interpreted  mode  and  a  switch 
to  compiled  mode  for  efficient  operation.  (3\USS 
and  MATIAB  already  mate  this  available,  to  seme 
extent,  by  allowring  some  compilation. 

It  can  be  argued  that  the  availability  of 
powerful  matrix  languages  on  personal  computers 
should  change  the  way  that  statistics  is  done. 
The  ®USS  documentation  suggests  that  one  should 
now  be  much  more  inclined  to  do  maximum 
likelihood  estimation  and  other  computations 
that  do  not  necessarily  give  closed-  fbim 
answers.  As  stated  in  the  GAUSS  manual,  page  9- 
2,  "Rgnenber ,  while  you're  having  dinner,  your 
PC  and  CftUSS  can  do  computations  that  would  cost 
hundreds  of  dollars  on  a  mainframe!."  The  icdeal 
situation  is  that  the  statistician  known  exactly 
how  to  do  the  appropriate  computations,  given 
the  powerful  hardware  and  software  on  the 
desktop,  but  there  will  need  to  be  a  lot  of 
changes  to  rranuals  and  textbooks  before  this 
knowledge  is  widespread.  Much  of  the  software 
is  here  now,  and  the  authors  of  these  languages 
deserve  thanks  for  making  available  excellent 
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tools  for  doing  the  computationally  intensive 
statistics  of  the  future. 
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Tabla  I.  MATRIX  LANGUAGES 


Table  3.  DATA  STRUCTURES 


HEED 

OPERATING  HARD 


SYSTEM 

ERitt.  PROTECTED? 

MEMORY 

APL 

STATG 

MS-DOS,  595  NO 

etc.  +695 

(295*  350  ACAD) 

NO 

5I2K 

GAUSS 

MS-DOS 

250 

NO 

NO 

256K 

SAS.IML 

PC-DOS, 

VMS,  TSO, 
CMS 

(SITE 

LEASE) 

NO 

YES 

5I2K 

SAS, MATRIX 

OS.CMS. 
VMS  .etc. 

MATLAB 

MS-DOS 

695 

(395  ACAD) 

YES 

NO 

256K 

S 

UNIX 

VMS 

8000 

(400  ACAD) 

SPEAKEASY 

MS-DOS, 

LEASE 

NO 

YES 

640K 

etc. 

VMS.  CMS 
TSO 


Table  2.  MOOES  OF  OPERATION  (ALL  INTERPRETED) 


BATCH 

EDITOR 

INTER¬ 

COM 

DIARY 

Eiii 

HELP 

PRINT 

mi 

APL 

STATG 

YES 

YES 

YES 

YES 

YES 

YES 

GAUSS 

YES 

YES 

YES 

YES 

NO 

YES 

IML 

YES 

SPF-TYPE 

YES 

YES 

YES 

YES 

MATRIX 

YES 

OPER 

SYS 

OPER 

SYS 

OPER 

SYS 

NO 

YES 

(NO 

TOGGLE) 

MATLAB 

YES 

OPER 

SYS 

YES 

YES 

YES 

NO 

S 

YES 

(UNIX) 

YES 

YES 

YES 

YES 

SPEAKE2 

YES 

LINE 

YES 

YES 

YES 

YES 

CHAR 

DATA 

3-D 

ARRAYS 

MISSING 

MIA 

VECTOR 

APL 

STATG 

YES 

YES 

YES 

YES 

GAUSS 

YES 

SORT  OF 

YES 

NO 

IML 

YES 

NO 

YES 

YES 

MATRIX 

NAMES 

NO 

NO 

NO 

MATLAB 

YES 

NO 

YES 

YES 

S 

YES 

YES 

YES 

YES 

SPEAKEZ 

YES 

SORT  OF 

YES 

YES 

Table  4.  ENVIRONMENT 


DOS 

COMMAHDS 

ASCII 

READ.WRITE 

FORTRAN 

LINK 

APL 

STATG 

YES, AND 
PROGRAMS 

YES 

CASSEMBLER 

GAUSS 

YES.  AND 
PROGRAMS 

SPACES 

NEEDED 

(NEW) 

IML 

YES,  AND 
PROGRAMS 

YES 

SHARED 

DATA 

MATRIX 

- 

- 

AWKWARD 

MATLAB 

YES,  AND 
PROGRAMS 

SPACES 

NEEDED 

SHARED 

DATA 

S 

UNIX 

YES 

YES 

(UNIX) 

SPEAKEZ 

YES,  AND 

YES 

INTEL 

PROGRAMS 


T atoll  5.  PROGRAMMING  STRUCTURES 


Tatola  7.  INDICES.  SUBSETTING,  CONCATENATION 


IF 


ELSEIF 

DO 

ELSE 

SUB 

LOOPS 

END 

ROUTINES 

APL 

NO 

NO 

LOCAL 

STATG 

GAUSS 

00 

YES 

(NEW) 

WHERE 

LOCAL 

COMPILE 

IML 

DO 

IF 

LOCAL 

END 

THEN 

MACRO 

ELSE 

MATRIX 

DO 

IF 

LINK 

END 

THEN 

RETURN 

ELSE 

MACRO 

MAT LAB 

FOR 

YES 

LOCAL 

END 

COMPILE 

S 

FOR(  ) 

IF 

MACRO 

{  ) 

ELSE 

SPEAKEZ 

FOR 

IF(  ) 

LOCAL 

NEXT 

(  ) 

Tatola  6.  OPERATIONS 


ELEMENT 

PRODUCT 

MATRIX 

PRODUCT 

INVERSE 

COL 

MEANS 

APL 

X 

♦X 

QA 

(♦«)+  ITpX 

GAUSS 

* 

* 

INV 

MEANC(X) 

IML 

# 

• 

INV 

XI  .) 

MATRIX 

# 

* 

INV 

X(  .) 

MATLAB 

* 

• 

INV 

(m,n)=s1ze(X) 

sum(X)/m 

S 

• 

** 

SOL  YE 

col(X.MEAN) 

SPEAKEZ  * 

(array) 

* 

(matrix) 

INVERSE 

SUMC0L$(X)/ 

NOROWS(X) 

INDEX 

YEGOR 

SUB 

MATRIX 

HORIZ 

CQNCAT 

VERT 

OONGAT 

APL 

iN 

A(c3;) 

L.R 

T,[  1)B 

GAUSS 

IN 

Al  1:3, .) 

L~R 

TIB 

IML 

):N 

At  1:3,  ] 

l||R 

T//B 

MATRIX 

IN 

A(  1:3,  ) 

MIR 

T//B 

MATLAB 

1:N 

A( 1:3,  :) 

(L  R] 

[T;B] 

S 

IN 

Al 1:3, ) 

C8IND 

(L.R) 
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PUBLISHING  STATISTICAL  SOFTWARE 


John  C.  Nash,  University  of  Ottawa 


Abstract 


Statistical  computation  is  at  the 
heart  of  a  large  part  of  all  statistical 
research  and  analysis.  The  growing 
complexity  and  diversity  of  software  for 
statistical  computations  implies  that 
statisticians  spend  a  growing  proportion 
of  their  professional  lives  developing, 
learning  and  using  such  software. 

This  paper  will  review  the  mechanisms 
by  which  statistical  software  Is 
published,  that  is,  made  available  to 
statistical  practitioners.  In 
particular,  emphasis  will  be  placed  on 
the  issue  of  academic  or  commercial 
credit  for  the  research  and  development 
work  which  good  software  demands. 
Potential  approaches  to  inclusion  of 
software  in  professional  performance 
evaluation  are  discussed. 

Traditional  software  publishing 

Traditional  approaches  to  publishing 
statistical  software  mirror  the  methods 
used  to  publish  scientific  ideas  in 
general.  That  is,  programs  of  a  more 
academic  nature  have  been  disseminated 
in  journals  and  related  sources,  while 
didactic  or  less  weighty  codes  have 
appeared  in  trade  or  special  Interest 
magazines.  Books,  too,  have  had  their 
role,  either  in  discussing  algorithms  in 
a  step-and-descrlption  fashion  which 
readers  can  then  program  in  a  manner 
suited  to  their  needs  and  Interests,  or 
in  presenting  listings  (generally  in 
FORTRAN)  of  the  author's  code. 

Examples  of  journals  which  have 
published  statistical  software  are  the 
ACM  Transactions  on  Mathematical 
Software,  with  the  companion  publication 
of  the  Collected  Algorithms  of  the  ACM 
(CALGO),  which  publishes  the  complete 
listings,  and  Applied  Statistics,  which 
has  Included  numbered  programs  since  the 
late  1960's.  Magazines  such  as  Byte  and 
Interface  Age  have  included  codes  from 
time  to  time  for  statistical 
applications.  Sadly,  many  of  these  have 
been  of  inferior  quality,  which  led  to 
my  own  involvement  in  writing  articles 
to  attempt  to  expose  the  difficulties  of 
preparing  scientific  software  (e.g. 

Nash,  1981). 

Books  containing  algorithms  —  for 
example,  Kennedy  and  Gentle  (1980)  or 
Nash  (1979)  —  are  rarer  than  those 
presenting  listings.  In  the  latter 
category  are  Ratkowsky's  (1983) 
presentation  of  nonlinear  least  squares 
FORTRAN  programs  and  Maindonald's  (1984) 
book  on  linear  statistical  computations 
with  programs  and  examples  mainly  in 
BASIC. 


Most  statisticians  are  familiar  with 
scientific  software  libraries,  from 
which  subroutines  (or  complete  programs) 
may  be  called  to  carry  out  calculations. 
Well-known  general  purpose  libraries  are 
those  of  IMSL  and  NAG,  while  BMDP 
focuses  on  biostatistlcal  routines. 
Several  libraries  suitable  for 
microcomputer  use  have  been  advertised 
in  the  recent  past.  However,  I  am  aware 
of  only  one  --  C.  Abaci's  Scientific 
Desk  --  of  which  the  authors  have  a 
serious  Interest  in  and  understanding  of 
numerical  reliability.  This  does  not 
mean  there  are  no  other  quality 
scientific  software  libraries,  but  that 
quality  is  difficult  to  ascertain. 

(Note  that  IMSL  and  NAG  have  released 
subsets  of  their  mainframe  libraries.) 

Such  libraries  can  be  considered  as  a 
form  of  software  publishing.  This  role 
becomes  clearer  when  the  activities  of 
traditional  publishing  houses  are  noted: 
Wiley's  efforts  with  the  Peerless 
Engineering  Service  Scientific 
Subroutine  Library  and  Wadsworth's 
investment  in  Statpro.  Ignoring  the 
important  question  of  commercial 
viability  of  such  ventures,  it  is 
necessary  to  decide  whether  the  software 
is  published  or  simply  "made  available", 
since  source  codes  may  not  be  released. 
For  the  present,  we  will  take  libraries 
as  a  form  of  publishing  of  software. 

Similarly,  packages  may  be  thought  of 
as  a  form  of  software  publishing,  though 
the  creative  core  of  the 
programmer/statistician's  art  is  now 
almost  certainly  hidden  in  object  code, 
and  often  behind  the  curtain  of  a 
(possibly  inconvenient)  user  Interface. 

Finally,  authors  of  technical  reports 
and  journal  articles  may  offer  to  make 
their  programs  available  privately.  The 
notice  of  the  availability  of  the 
programs  constitutes  their 
"publ icat ion" . 

Paper-based  dissemination  of  software 
has  the  distinct  advantage  that  humans 
can  read  and  appreciate  it.  However, 
while  It  is  relatively  inexpensive  to 
obtain  listings,  implementation  and 
testing  may  be  costly,  if  not  in  money, 
then  certainly  in  effort  and  time. 

There  is  also  the  question  of 
publication  delay,  which  implies  that 
the  user  gets  a  program  which  is  likely 
to  date  from  two  years  prior  to 
publication,  even  if  hot  off  the  press. 
Computer  magazines,  traditionally 
up-to-date  with  news  and  information, 
now  have  publication  delays  approaching 
a  year. 
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Updates  to  books  and  journal 
articles,  except  for  errata,  are  likely 
to  be  similarly  delayed.  Worse,  modest 
changes  which  dramatically  improve  the 
efficiency  of  the  program  even  though 
they  do  not  change  the  scientific 
content  of  ideas  therein  may  not  be 
considered  "original"  by  journal 
editors.  Updates  and  improvements  are 
therefore  unlikely  to  be  easily 
obtained . 

Private  distribution  by  the  author  or 
his/her  organization  has  been  a  popular 
mechanism  for  scientific  software 
distribution.  It  works  well  as  long  as 
the  chore  does  not  become  too  tedious  or 
expensive.  Since  people  move,  change 
jobs,  take  sabbatical  leave,  etc.,  the 
distribution  link  usually  dies  after  a 
few  years.  With  no  external  monitoring 
of  the  design  and  development  of  such 
software,  quality  may  vary  greatly. 

Some  paper  publications  also  offer 
distribution  of  machine  readable  media. 
Originally  this  meant  cards  or  magnetic 
tapes.  More  recently,  diskettes  have 
become  popular,  with  some  books  even 
including  a  pocket  in  one  cover  for  the 
diskette  (e.g.  Nash  and  Walker-Smith, 
1986).  Another  approach  involves 
printing  bar-code  style  information  in 
strips  right  on  the  paper  which  can  be 
read  with  a  special  reader  (the  Cauzin 
Systems  Softstrlp  mechanism). 

Alternative  publishing  mechanisms 

Other  mechanisms  exist  by  which 
software  may  be  published. 

"Shareware”  or  "freeware",  also 
called  "user  supported  software",  is 
distributed  by  users  giving  copies  to 
colleagues  and  associates.  The  software 
creator  generally  claims  copyright  on 
the  product,  but  grants  permission  for 
the  copying.  A  fee  is  demanded  for  the 
manual  and  for  updates  and  corrections 
of  the  software  or  for  technical 
support,  which  are  usually  supplied  by 
the  author.  In  some  cases,  for  example 
the  word  processing  program  PC-WRITE, 
this  concept  has  proven  extremely 
successful.  EPISTAT  is  a  suite  of 
statistical  programs  distributed  as 
shareware . 

Various  computer  bulletin  boards  and 
some  commercial  information  vendors 
provide  for  the  down-loading  of  public 
domain  software.  Mostly,  this  consists 
of  games  or  utility  programs,  though 
some  technical  software  may  be  found  on 
the  commercial  systems,  for  example,  the 
Byte  Information  Exchange  (BIX).  User 
charges  are  generally  based  on  connect 
time  plus  telecommunications  charges. 

A  further  alternative  Is  electronic 
mall.  Oongarra  and  Grosse  (1985,  1986) 
offer  an  impressive  array  of 
mathematical  software,  available  to 
anyone  with  access  directly  or 
indirectly  to  the  ARPAnet.  This 


includes  most  of  the  academic  electronic 
mall  networks,  such  as  BITNET  (and 
MBTNORTH ) ,  MAILNET,  CS-NET,  USENET, 
PHONENET,  and  others.  To  obtain  a 
listing  of  what  is  available,  a  user 
sends  the  message 

SEND  INDEX 

to  the  pseudo  user  NETLIB 
at  the  ANL-HCS  node  of  the  ARPAnet. 
(NETLIB8ANL-MCS.ARPA) .  A  similar 
message,  for  example, 

SEND  CSVDC  FROM  LINPACK 

will  initiate  the  reply  which  consists 
of  that  FORTRAN  subroutine  from  the 
LINPACK  collection.  Currently  available 
are 

EISPACK  —  matrix  eigenvalue  problems 
LINPACK  --  linear  algebraic  calculations 
CALGO  —  Collected  Algorithms  of  the 
ACM 

FUNPACK  —  special  functions 
and  a  number  of  other  collections. 

Comparison  of  alternative  and  traditional 

software  publishing  mechanisms 

To  the  best  of  my  knowledge,  a 
comprehensive  cost-benefit  comparison  of 
the  different  modes  of  software 
publishing  has  not  been  carried  out. 
Statistical  software,  which  is  a 
relatively  small  market  segment  of  the 
overall  software  marketplace,  cannot  be 
compared  directly  with  such  products  as 
word-processing  packages  or 
communications  packages. 

Nevertheless,  it  is  clear  that 
shareware,  online  and  elmall  software 
delivery  modes  all  offer  easy  updating 
of  programs,  providing  the  user  is 
willing  to  pay  the  generally  small 
charges  for  contacting  the  human  or 
machine  sources.  It  is  in  this 
placement  of  the  responsibility  for 
obtaining  updates  on  the  user  that 
fairly  substantial  cost  savings  to  the 
vendor  /  supplier  arise.  Instead  of  the 
software  vendor  mailing  updates  to  many 
users,  only  those  who  request  them  are 
serviced.  This  may  lead  to  many  users 
working  with  obsolete  or  defective 
versions  of  the  programs,  but  this  is 
hardly  different  from  the  situation 
where  a  user  has  not  seen  or  bothered  to 
implement  a  published  correction. 

Software  £  documentation  are  generally 
delivered  in  machine  readable  form  by 
the  alternative  publishing  mechanisms. 
Some  users  may  prefer  nicely  printed 
manuals  including  tutorials,  reference 
material,  and  installation  guides.  For 
electronically  transmitted  files,  data 
compression  may  be  advisable  to  cut 
telecommunication  costs. 

Version/edition  control  may  be  a 


problem,  especially  with  the  temptation 
to  add  updates  as  soon  as  they  are 
available . 

The  mode  of  use  of  software 
distributed  by  online  or  electronic  mail 
mechanisms  will  be  considerably 
different  from  paper-based  distribution. 
Compared  to  conventional 
libraries/packages,  the  source  of 
advice/support  will  likely  become 
the  database  supplier  in  place  of  the 
computing  center  of  a  university  / 
company  /  institute.  The  user  may  have 
a  much  bigger  role  to  play  in  the 
evolution  of  software,  for  example,  in 
the  types  of  software  supplied,  the 
correction  of  faults,  the  enhancement  of 
features,  the  development  of 
applications  and  the  preparation  of 
documentation. 

economic  and  academic  Issues 

All  of  the  alternative  distribution 
mechanisms  have  the  advantage  of  initial 
low-cost  of  delivery  to  the  user.  They 
rely  on  word-of-mouth  or  traditional 
advertisements,  however,  to  attract  user 
attention,  and  may  therefore  not  achieve 
a  desired  level  of  awareness.  Shareware 
may  Impose  costs  of  production  for 
manuals,  updates,  etc.  on  the  vendor  of 
a  very  similar  level  to  those 
experienced  by  traditional  software 
publishers . 

It  is  more  difficult  to  estimate 
costs  for  online  and  electronic  mail 
modes  of  distribution.  For  sake  of 
discussion,  a  figure  of  the  order  of  25 
cents/1000  characters  is  probably 
reasonable  at  the  present  time  for  the 
communications  costs.  (Elmall  is 
probably  cheaper  than  online 
distribution,  but  many  of  the  costs  are 
buried  within  the  overall  network 
management  costs,  frequently  borne  by 
universities  as  a  service  to  their 
members.)  Both  of  the  electronic  methods 
of  distribution  gain  by  the  lack  of 
human  intervention  in  the  distribution 
process.  Furthermore,  the  user  may 
choose  to  download  only  a  small  segment 
of  software  or  documentation  r i 
particular  Interest. 

To  date,  none  of  the  software  being 
distributed  electronically  is  returning 
any  revenue  to  its  creators.  Even 
commercial  systems  are  charging  mainly 
for  telecommunications  and  database 
provision  services,  along  with  profit 
for  the  vendor.  If  software  authors  are 
to  be  expected  to  create  programs  for 
distribution,  they  must  be  rewarded,  and 
pricing  mechanisms  which  balance  between 
author  greed  and  user  theft  (i.e. 
unauthorized  copying)  are  needed.  The 
major  unanswered  question  is  whether  a 
price  exists  which  is  high  enough  to 
cover  the  hardware  and  communications 
costs  and  a  royalty  to  the  author,  but 
low  enough  so  that  it  is  not  worthwhile 


to  the  user  to  have  any  but  the  latest 
authorized  version. 

The  obstacles  to  the  development  of 
alternative  methods  for  publishing 
statistical  and  other  scientific 
software  are  primarily  those  relating  to 
credit  for  the  creation  of  intellectual 
property.  In  particular,  the  general 
academic  fixation  with  paper-based 
journals  Implies  that  authors  who 
support  alternative  publication  vehicles 
may  find  that  their  work  "doesn’t  count" 
for  academic  rewards  in  tenure  or 
promotion.  A  related  issue  is  that  of 
crediting  workers  who  actually  support 
software  in  general  use  —  software  upon 
which  much  research  In  practically  all 
disciplines  may  depend.  Some  years  ago, 

I  derived  formulas  to  extend  the  Glnl 
Ratio  --  a  statistic  used  to  assess  the 
uniformity  of  income  distributions  —  to 
the  situation  where  Incomes  may  be 
negative.  Furthermore,  I  documented  the 
interpretation  of  the  statistic  in  these 
cases,  wrote  a  program  to  analyze  the 
data,  and  ran  a  considerable  portion  of 
the  calculations.  However,  when  the 
report  of  this  work  was  published,  I  was 
neither  listed  as  author,  nor  mentioned 
as  someone  who  had  provided  assistance. 
The  point  here  is  not  that  the  authors 
were  ill-intentioned,  but  that  the  role 
of  "supporting  cast"  is  often  accorded  a 
very  limited  status.  In  the  case 
described,  pointing  out  to  the  authors 
the  extent  to  which  the  results  of  their 
work  relied  on  my  software  resulted  In 
satisfactory  recognition. 

A  more  general  difficulty  concerns 
the  possibility  that  a  program  may  be 
altered  over  time  by  contributions  from 
a  number  of  workers.  Who  should  then 
get  the  credit?  This  is  a  continuing 
Issue.  It  Is  compounded  by  the  reality 
that  a  researcher  gets  more  academic 
value  from  a  completely  different 
program,  even  if  it  does  not  work 
particularly  well,  than  from  a  minor 
change  to  an  existing  program  which 
doubles  its  performance. 

Vhat  can  be  dona  to  ensure  good 
software  is  published? 

This  paper  will  not  attempt  to  define 
"good  software".  However,  whatever 
metric  is  used  to  judge  software  as 
good,  I  would  suggest  that  the 
publication  of  such  software  should  be 
such  that: 

1)  it  is  widely  available  and  easily 

obtained  and  Installed; 

2)  the  price  should  be  reasonable, 

that  is,  comparable  to  the  price  of 

a  statistics  monograph. 

It  is  possible  that  if  such 
conditions  apply,  the  only  software 
available  would  be  good  by  virtue  of 


Market  competition.  In  order  to  be 
worthwhile  tor  programmers  and 
statisticians  to  create  this  software, 
the  profession  must  recognize,  as  a 
body, 

-  that  communications  need  not  be  on 
paper; 

-  that  software  Is  an  important  part 
of  our  work. 

Notable  movements  in  these  directions 
are  perceptible: 

1)  the  Natural  Sciences  and 
Engineering  Research  Council  of 
Canada  asks  researchers  in  the 
Computer  Science  category  to  list 
software  packages  produced. 

2)  statistics  journals  and 
newsletters,  in  particular  American 
Statistician,  have  been  Including  an 
increasing  amount  of  software 
announcements  and  reviews. 

3)  the  Dongarra  /  Grosse  NETLIB 
project  has  established  the  technical 
possibility  of  electronic  mail  for 
software  distribution. 

While  the  development  of  electronic 
distribution  depends  on  the  growth  of 
the  required  network  Infrastructure, 
there  are  a  number  of  individual 
initiatives  possible.  Statisticians  can 
strive  to  ensure  that  software  support 
and  assistance  Is  properly  credited. 
Personal  software  activity  can  be  listed 
In  annual  reports.  At  the  risk  of 
giving  offense  to  users,  one  can  demand 
acknowledgement  or  even  co-authorship. 
More  subtly,  the  mindset  of  the 
profession  can  be  Influenced  by  asking 
questions  about  software  generation  and 
support  In  employment  Interviews  or 
questionnaires.  Use  of  existing 
electronic  mail  facilities  for 
transmission  of  software,  manuscript  or 
bibliographic  material  Is  a  useful  step 
to  learning  how,  and  how  well,  systems 
work . 

At  present.  It  is  unclear  which  of 
the  different  options  for  statistical 
software  publishing  will  assume  major 
roles  in  the  coming  decade.  It  seems 
likely  that  statistical  software  will  be 
published  mostly  by  electronic  means  at 
some  point  In  the  not  very  distant 
future.  However,  the  precise  mix  of 
delivery  methods  remains  to  be  revealed 
by  the  passage  of  time. 
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Abstract 


Consider  a  network  of  priority  queues,  and  suppose 
one  is  interested  in  describing  some  characteristic, 
say  /(X),  of  a  particular  response  time  distribution 
as  a  function  of  the  arrival  rate  X.  Here,  /(X) 
might  be  a  moment  or  quantile  of  the  response  time 
distribution,  or  any  of  a  number  of  other  interesting 
functions  of  the  arrival  rate.  In  this  paper,  a 
technique  for  estimating  /(X)  as  a  function  of  X  over 
some  region  of  interest  is  presented.  The  technique 
involves  estimation  of  the  /(X)  at  a  few  values  of  X 
by  discrete  event  simulation,  normalization  of  the 
estimated  /(X),  regression  of  a  low  order  polynomial 
on  the  normalized  estimated  /(X),  and  the  heavy 
traffic  value,  and,  finally,  a  renormalization  of  the 
fitted  polynomial. 


1.  Introduction 

In  the  study  of  complex  networks  of  priority  queues 
encountered  in  computer  and  communication 

system  modeling,  one  is  often  interested  in 
describing  characteristics  of  some  "steady  state" 
response  time  distribution  as  a  function  of  the  rate 
at  which  customers  arrive  at  the  network.  This 
paper  presents  a  simulation  -  heavy  traffic 
interpolation  methodology  that  is  useful  for 

providing  such  descriptions.  In  Section  2,  we 

describe  a  general  class  of  queueing  network  models 
for  which  the  interpolation  methodology  is 
applicable.  Sections  3,  4,  and  5  describe  the  three 
main  ingredients  of  the  interpolation  method; 
namely,  simulation,  heavy  traffic  and  the 
normalization.  In  Section  6  we  will  describe  the 
interpolation  technique  and  in  Section  7  we 

illustrate  the  technique  with  an  example. 

2.  A  Class  of  Priority  Queueing  Network  Models 

The  systems  we  will  consider  here  are  open 
networks  of  priority  queues  with  K  <  oc  customer 
types.  A  customer  type,  y,  is  specified  by  three 
vectors  of  length  Ly  (L,  is  the  number  of  steps  in  y’s 
itinerary).  The  first  vector, 


(nodey(  1),  nodey( 2),  ...,  nodey(Ly)),  gives  the 
sequence  of  nodes  that  y  visits.  The  second  vector, 
(prioy(l),  prioy(2),  ...,  prioy(Ly)),  gives  the  priority 
levels  at  each  step,  and  the  third  vector, 

(*y(l)»*>(2) . gy(Ly))>  are  the  service  time 

distributions  for  each  step. 

Note  that  random  routes  are  allowed  as  long  as 
there  is  a  bound  on  the  potential  length  of  the  route 
(i.e.  a  finite  number  of  possibilities).  This  finite 
restriction  can  be  relaxed,  but  the  notation  becomes 
burdonsome.  Complicated  routing  schemes,  such 
as  "nested  Markov  routing",  are  described  in  Simon 
[1985]. 

The  queueing  discipline  at  each  node  in  the 
network  is  preemptive  resume.  Type  y  customers 
enter  the  system  as  a  Poisson  process  with  rate  Xr, 
and  the  arrival  streams  of  the  K  customer  types  are 
mutually  independent.  Note  that  the  number  of 
nodes  in  the  network  and  the  number  of  priority 
levels  at  each  node  are  arbitrary,  and  are  given 
implicitly  by  the  vectors  nodey  and  prioy, 
y  =  1,2,  •••,*. 

3.  Discrete  Event  Simulation 

In  the  study  of  complex  networks  of  priority  queues 
encountered  in  practice,  discrete  event  simulation  is 
a  useful  tool  for  providing  reliable  descriptions  of 
response  time  characteristics  of  interest  if  care  is 
taken  in  the  design  and  implementation  of  the 
simulation  experiment  and  analysis  of  the 
simulation  output  (see  Fishman  [1978]  and  Igiehart 
and  Shedler  [1980],  for  example). 

For  some  system,  arrival  rate  \j  ,  (0  <  \j  <  c ) 
and  response  time  characteristic  f(Xjj  of  interest, 
suppose  we  obtain  a  point  estimate  f(\j)  of  f(Xj) 
via  a  simulation  experiment.  Since  /(Xj)  depends 
on  the  particular  observation  of  the  system  from  the 
simulation  run,  it  is  often  important  that  some 
assessment  of  the  accuracy  of  the  point  estimate 
f(kj)  be  obtained.  There  are  several  methods 
available  to  the  experimenter  for  assessing  the 
statistical  precision  of  the  estimates  of  response 
time  characteristics  based  on  simulation  output  data 
(see  Welch  [1983]  and  Hddelberger  and 
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Lavenbcrg  [1984]).  For  example,  the  regenerative 
simulation  method  (when  applicable)  provides  a 
technique  for  approximating  the  distribution  of 
f(kj).  In  the  regenerative  method,  a  cycle  is  said  to 
begin  when  a  state  of  the  system  is  reached  such 
that  future  behavior  is  independent  of  past  behavior 
and  identical  (in  distribution)  to  every  other  time  a 
cycle  begins.  For  a  fixed  number  n  of  cycles,  the 
regenerative  method  provides  an  estimate  f(\j)  and 
a  statistic  rc[/(X)]  such  that  if  n  is  large, 


Am  -  /(M 
*[AM] 


~  N(0,1)  . 


The  ~  in  (3.1)  means  that  the  random  variable  (or 
statistic)  on  the  left  is  approximately  distributed  as 
the  random  variable  on  the  right,  and  N(0,1) 
denotes  a  normal  or  Gaussian  random  variable  with 
zero  mean  and  unit  variance.  The  denominator  on 
the  left  in  (3.1)  is  generally  referred  to  as  the 
estimated  standard  error  of  f(\j).  Approximate 
confidence  statements  about  the  value  /(\;)  can  be 
based  cm  (3.1). 

In  Section  6  below,  an  approximation  like  (3.1) 
will  be  an  important  aspect  in  the  development  of 
the  interpolation  methodology. 

Note  that  a  major  drawback  of  pure  simulation 
methodology  is,  of  course,  the  computional  costs 
associated  with  a  detailed  study  of  the  system  under 
investigation.  This  can  be  particularly  true  when  a 
description  of  some  response  time  characteristic  for 
relatively  high  arrival  rates  is  desired  (see 
Blomqvist  [1967]). 

4.  Heavy  Traffic  Theory 

Many  of  the  interesting  performance  measures  of 
our  queueing  systems,  such  as  moments  and 
quantiles  of  response  time  and  queue  length 
distributions  become  unbounded  as  the  arrival  rate 
to  the  system  approaches  capacity,  c  .  Roughly 
speaking,  the  heavy  traffic  theory  of  queues 
quantifies  the  rate  at  which  these  functions 
approach  infinity,  so  that  if  the  functions  are 
properly  normalized,  one  can  obtain  exact  (finite) 
limits  of  the  functions  as  the  arrival  rate  approaches 
capacity.  For  example,  if  W„(X)  is  the  n*  moment 
of  the  response  time  distribution  of  a  customer  who 
requires  service  once,  then 

lim  (c  -  X)"Wn(X)  =  -=1  (4.1) 

X-c  -y" 

where  7  is  a  quantity  that  can  be  calculated  in 
terms  of  the  system  parameters  for  a  large  class  of 
systems. 


In  many  systems  (even  fairly  complicated  systems 
such  as  the  example  in  Section  7),  7  can  be 
computed  by  hand.  Systems  with  nested  Markov 
routing  can  be  solved  exactly  via  systems  of 
simultaeous  equations  (see  Simon  [1985]),  and  as  a 
last  resort,  in  the  most  general  cases  (i.e.  nested 
semi-Markov  routing),  y  can  be  obtained  from  a 
light  traffic  simulation.  Obtaining  y  from  a 
simulation  experiment  will  be  addressed  in  a 
forthcoming  paper  (see  Simon  and  Willie  [1986]). 

Equation  (4.1)  is  actually  a  consequence  of  a  more 
general  result.  If  W(r,X)  is  the  probability  that  die 
response  time  of  a  customer  (who  requires  service 
once)  is  greater  than  t  when  die  arrival  rate  is  X, 
then 


lim  W  — ,X 
X-c  c-X 


=  *-•»' . 


Equation  (4.2)  allows  us  to  compute  quantiles:  If 
/(X)  denotes  the  p*  quantile  of  the  response  time 
distribution  when  the  arrival  rate  is  X,  (4.2)  implies 
that 


lim  (c-X) /(X) 

X-c 


=  1  In  _L_  . 

y  i-pj 


Both  (4.1)  and  (4.2)  can  be  generalized  to 
customer  types  that  require  service  more  than  once, 
or  require  service  a  random  number  of  times  (e.g. 
queues  with  feedback).  The  analog  to  (4.2)  is  a 
weighted  sum  of  exponentials,  and  (4.1)  becomes 
the  n*  moment  of  that  distribution.  Although  die 
analog  of  (4.3)  cannot  be  written  down  in  dosed 
form  in  the  general  case,  die  heavy  traffic  limits  of 
the  quantile  functions  can  be  easily  computed 
numerically. 

It  should  be  pointed  out  that  there  remain  some 
unresolved  issues  associated  with  a  rigorous 
derivation  of  (4.2).  Equation  (4.2)  assumes  that 
the  stationary  distribution  of  the  limiting  queueing 
process  (reflected  brownian  motion)  is  the  limiting 
stationary  distribution  of  the  queueing  process.  This 
interchange  of  limits  has  never  been  demomstrated 
rigorously,  although  empirical  (as  well  as  intuitive) 
evidence  seems  to  imply  it  is  true. 

5.  The  Normalization 

From  equation  (4.1)  we  see  that  if  we  normalize 
the  function  W„(k)  by  (c  -  X)",  it  will  be  finite  for 
X  in  the  interval  [0  ,  c].  The  same  normalization 
will  keep  Q„(X),  the  nA  moment  of  die  queue 
length  distribution,  finite. 

A  good  normalizer  will  do  more  than  just  keep  the 
function  finite,  though.  Suppose  we  want  to 
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approximate  a  function,  /(X),  which  has  the  form 
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f(k)  =  mi _ m _ 

n  J  B(\)  B,(X)B2(X)  •  •  •  B*(X) 

where  A(\)  and  B,(\)  i  =  1,2,...,*  are 
polynomials  (i.e.  /(X)  is  rational).  Suppose  we 
normalize  /(X)  by  B(X)  so  that  we  are  left  to 
approximate  y(X)  =  B(X)/(X)  =  A(k).  Since  we 
approximate  g(X)  by  a  polynomial,  (and  since 
g(\)  is  a  polynomial),  if  we  have  enough 
information  about  g (X)  (i.e.  a  sufficient  number  of 
simulation  points,  along  with  the  heavy  traffic 
limit),  the  only  error  in  the  approximation  will  be 
due  to  uncertainty  in  the  simulation  data.  For 
example,  if  >4(X)  is  quadratic,  and  we  have  two 
simulation  points  and  the  heavy  traffic  limit  (or  one 
simulation  point  and  the  light  and  heavy  traffic 
limits),  then  A(\)  is  uniquely  determined.  Thus, 
the  ideal  normalizer  would  be  the  "denominator''  of 


AM- 


Many  simple  functions  of  queueing  systems  are 
known  to  be  rational,  and  one  may  conjecture  that 
tV„(X),  Q„(k)  and  other  functions  of  interest  are 
rational  for  very  general  systems.  Unfortunately, 
even  if  we  know  that  /(X)  is  rational,  we  may  still 
have  no  idea  what  B(X)  is.  Our  approach  is  to  try 
to  identify  as  many  of  the  B,(X)'j  as  possible,  and 
use  their  product  as  a  normalizer.  Much  of  this 
work  is  heuristic,  and  it  remains  to  be  proven  that 
the  terms  we  identify  actually  appear  in  B(X). 


First  of  all,  heavy  traffic  theory  shows  conclusively 
that  B(X)  contains  a  term  (c  -  X)"  if  /(X)  is  W„(X) 
or  Cn(X)  (if  /(X)  is  a  quantile  function  then  B(X) 
contains  (c  -  X)).  We  conjecture  that  there  are 
two  other  classes  of  terms  present  in  B(X).  The  first 
corresponds  to  "high  priority  traffic",  the  second  is 
analogous  to  (c  -  X),  but  is  due  to  non-bottleneck 
nodes  (see  Reiman  and  Simon  [1985]  for  details). 
The  optimal  choice  of  a  normalizer  is  an  important 
and  interesting  research  area. 


The  normalizer  for  response  time  quantiles  will 
always  have  the  form  (c  -  X)n(X),  where  n(X)  is 
finite  for  0  s  X  s  c.  Thus,  we  can  rewrite  (4.2) 
and  (4.3)  as 


lim  W 

X**c 


( (c~ X)  n(X)  ’ 


and 


lim  (c-X)  n(X)  /(X) 


(.  The  Interpolation  Methodology 


In  this  section,  we  will  discuss  the  estimation 
methodology  associated  with  characterizing  some 
"steady  state"  response  time  characteristic,  say  /(X) 
as  a  function  of  overall  arrival  rate. 


Suppose  we  have  estimates  f(\j)  of  /(X,), 
j  -  1 ,  •  •  •  ,  J  from  a  few  (independent) 
simulation  experiments.  Suppose^  also,  that  we 
have  statistics  se[/(X!)]  ,  •  •  •  ,  je[/(Xj)]  such  that 


/(Xj)  ~  /(X;) 

*[/(*)] 


N(0,1)  , 


(6.1) 


j  =  1 ,  •  •  •  ,  J  (see  Section  3).  Let  X^  denote  the 
minimum  of  the  arrival  rates,  Xj  ,  •  •  •  ,  Xy  and  let 
n(X)  denote  a  normalizer  of  the  type  considered  in 
Section  4.  Assume  that  the  heavy  traffic  limit 


g(c)  =  lim  (e  -  X)/(X)  n(X) 


is  known  (exactly). 

Let  us  suppose  that  for  X^  s  X  s  c  , 

*00  =  /(X,)  (c  ~  X)  n(X) 

*  f  h*X*  =  gW(X),  (6.2) 

k-0 

for  some  d  and  coefficients,  b0,  •  •  •  ,  bd.  In  other 
words,  we  assume  that  the  normalized  f(\)  can  be 
approximated  by  a  some  polynomial  of  order  d 
over  the  interval  [Xmill,  c].  The  problem  of 
characterizing  the  normalized  /(X)  is  now  one  of 
determining  the  order  d  and  coefficients 
b0,  •  ■  ■  ,  bd  of  the  approximating  polynomial.  For 
j  -  1,  ■■■  ,J,  form 

*(X,)  =  /(X,)  (c  -  X)  n(X)  , 

and 

«[*(X;)]  -  se\f{\j))  (c  -  X)  n(X)  . 

Pretenting  that  d  is  known,  for  the  moment,  a 
natural  approach  to  determining  the  coefficients  of 
the  polynomial  is  to  fit  the  right  hand  side  of  (6.2) 
to  the  statistics  g(Xj),  •  ■  ■  ,  j?(X/)  subject  to  the 
constraint  that  g^\c)  =  g{c).  The  approximation 
in  (6.1)  strongly  suggests  (see,  for  example,  Lewis 
and  Odell  [1971])  that  we  fit  the  polynomial  using 
a  constrained,  weighted  least  squares  procedure: 
Let  §  denote  the  vector  of  length  J  with  elements 
£(X j),  •  ■  •  ,  £(Xy),  and  denote  by  b,  the  vector  of 
length  (d+ 1)  with  elements  b0,  •  •  ■  ,  bd.  Also,  let 
V  denote  the  Jxj  diagonal  matrix  with  diagonal 
elements  v y  given  by 

vjj  =  (j«[A\*)])  , 


I 


respectively. 


;  =  1,  ■  •  •  ,  J,  and 


1  1  •••  1 

X]  Xj  ••  kj 

k\  x§  •••  x? 


[  xf  x^  ■■■  kj 


Note  that  V  is  an  estimate  of  the  covariance  matrix 
of  g.  The  constrained  least  squares  estimate  of  the 
coefficients  vector,  b.  is  the  vector  (call  it  b)  which 
minimizes 


(i  -  A'b)'\r'(g  -  A'b)  (6.3) 

subject  to  c'b  =  g(c ),  where  c  =  (  1  c  c2  ■■■  cd  )' 
and  where  '  denotes  matrix  transposition. 
Assuming  A  is  of  full-rank,  J  a  d+ 1,  and  the 
are  positive,  the  constrained  minimum  of  (6.3)  is  at 
(see,  for  example,  Section  6.3  of  Lewis  and 
Odell  [1971]) 

b  =  b  +  S-1c[c'S_1c]-I[g(c)  -  c'b] 
where 

b  =  S~'AV~lg 


and 

S  =  AV'A'  . 

Note  that  b  is  the  unconstrained,  weighted  least 
squares  estimate  of  b.  The  variance-covariance 
matrix  of  b  is  given  by 

cov(h)  =  S-1  -  S~1c[c'5c]"1c'S_1 

Note  that  the  first  term  on  the  right  directly  above 
is  the  variance-covariance  matrix  of  b,  so  that  the 
added  information  contained  in  the  constraint, 
namely  c’b  =  g(c),  leads  to  a  reduction  in  the 
variance  of  the  estimate  of  b  over  the 
unconstrained  estimate. 


Now,  for  an  arbitrary  X,  Xmin  <  Xs  c  ,  we 
estimate  gw(X)  by 

iw(k)  =  i  vx*  =  yb . 

*»  0 

where  X  =  (  1  X  X2  •  •  •  kd  )'.  Renormalizing, 

we  estimate 


by 


/W(X)  = 


*W(X) _ 

[(c  -  X)  n(X)] 


/“(X) 


iw(M _ 

1(C  -  x)  n(X)] 


If  the  approximation  given  in  (6.1)  is  reasonable, 
we  expect  the  approximation 

~  N<°’i)'  (6-4) 
should  also  be  reasonable,  with 


se\M(k)\  =  VX’  cov(^)  X 
^  (  )J  [(c-X)n(X)]  ’ 

(see,  for  example,  Section  3b  of  Rao  [1973]). 


Above,  we  have  been  assuming  that  a  value  of  d 
which  provides  a  "good"  approximation  in  (6.2)  is 
known.  Note  that,  except  for  a  few  very  simple 
systems,  this  is  generally  not  the  case.  However, 
the  approximations  given  in  (6.1)  and  (6.4)  provide 
a  means  of  partially  checking  the  adequacy  of  the 
constrained  fit  of  a  polynomial  of  degree  d  to  the 
normalized  quantile  estimates  and  the  heavy  traffic 
limit  point.  The  problem  of  determining  the  order 
of  the  polynomial  can  therefore  be  approached 
empirically.  For  example,  a  sensible  procedure  is 
to  choose  the  order  of  the  approximation  to  be  the 
smallest  d,  such  that  the  fitted  polynomial  is 
"reasonable!/'  close  (relative  to  sampling 
fluctuations)  to  the  normalized  quantile  estimates. 

As  noted  in  Section  4,  it  is  not  always  possible  to 
compute  g(c)  exactly.  However,  an  estimate  §(c) 
of  g(c)  can  generally  be  obtained  from  a  light 
traffic  simulation  experiment.  The  extension  of  the 
above  interpolation  methodology  to  the  case  where 
g(c)  is  estimated  via  simulation  will  be  discussed  in 
a  forthcoming  paper  (see  Simon  and  Willie  [1986]). 


7.  An  Example 

The  queueing  network  considered  in  this  section 
has  two  servers,  a  CPU  and  a  disk.  Four  different 
types  of  customers  arrive  at  the  system  as 
independent  Poisson  processes  with  arrival  rates  in 
set  ratios,  and  there  are  four  levels  of  priorities  at 
the  CPU.  Here,  the  overall  arrival  rate,  X,  is  the 
sum  of  the  four  arrival  rates  of  the  processes 
corresponding  to  the  four  customer  types. 
Customers  of  type  1  have  the  highest  priority  at  the 
CPU  and  require  4.0  milliseconds  of  CPU  time 
before  departing  from  the  system.  Type  2 
customers  require  a  random  amount  of  service  time 
at  the  CPU  from  a  particular  10-point  distribution: 
The  service  time  is  assigned  the  value  10.0,  20.0, 
30.0,  40.0,  50.0,  60.0,  70.0,  80.0,  90.0,  and  100.0 
milliseconds  with  probability  0.23,  0.23,  0.23, 
0.23  ,  0.03  ,  0.01,  0.01,  0.01,  0.01,  and  0.01, 
respectively.  Type  2  customers  have  second  highest 
priority  at  the  CPU  and  after  receiving  service  at 
the  CPU,  they  depart  from  the  system.  Customers 
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of  type  3  are  similar  to  those  of  type  1  except  that 
type  3  customers  require  only  1.0  milliseconds  of 
CPU  service  and  have  only  a  relative  priority  level 
of  three.  A  customer  of  type  4  first  visits  the  CPU 
upon  arrival  to  the  system.  After  receiving  50.0 
milliseconds  of  service  at  the  CPU,  the  customer 
goes  to  the  disk  and  requires  a  random  amount  of 
service  time  that  is  uniformly  distributed  over  the 
interval  (30.0,70.0].  The  customer  then  returns  to 
the  CPU.  This  sequence  of  steps  is  replicated  a 
total  of  12  times  and  then  the  customer  departs 
from  the  system.  Type  4  customers  have  the  lowest 
priority  at  the  CPU. 

The  capacity  of  the  system,  c,  is  easily  calculated  as 
to  be  0.0820  total  arrivals  per  millisecond.  The 
arrival  rates  of  customers  of  type  1,  2,  3,  and  4  are 
0.47330  •  X,  0.20930  •  X,  0.31120  X,  and 
0.00620  •  X,  respectively. 


lim  (c-X)/(X) 


__  In(O.Ol)  _ 
0.239 


Since  there  are  customers  at  the  CPU  with  priority 
higher  than  customers  of  type  4  and  since  a 
customer  of  type  4  goes  to  the  disk,  we  normalize 
our  system  with  two  extra  terms, 


"(*•)  =  (Cff  “M  (C2“>0  . 

where  cH  ~  0.118  (the  capacity  at  the  CPU  as  seen 
by  customers  with  higher  priority  than  transaction 
4),  and  c2  ~  0.268  (the  capacity  at  the  disk).  Thus, 
we  set 


g(X)  =  (c-X)/(X)  n(X). 


The  heavy  traffic  limit  of  the  fully  normalized 
system  is 


limg(X)  ~  0.129. 


In  this  example,  we  are  interested  in  describing  the 
0.99'*  quantile  of  the  "steady  state"  response  time 
distribution  associated  with  the  first  two  steps  of  a 
type  4  customer  (i.e.,  the  first  CPU  to  disk).  In 
particular,  we  desire  a  characterization  of  this 
quantile  as  a  function  of  overall  arrival  rate.  Here, 
/(X)  denotes  this  function. 


In  Figure  7.1,  we  present  the  approximation  of  the 
normalized  0.991*  quantile  of  die  response  time 
distribution  of  interest  as  a  function  of  the  overall 
rate  of  arrival  of  customers  to  the  queueing 
network. 

FIGURE  7.1 


For  the  above  network,  discrete  event  simulation 
experiments  were  performed  for  overall  arrival  rates 
of  X,  =  0.0410,  X2  =  0.0492,  X3  =  0.0573,  and 
X«  =  0.0655  total  arrivals  per  millisecond.  These 
arrival  rates  correspond  to  traffic  intensities  of 
0.50,  0.60,  0.70  and  0.80,  respectively.  Based  on 
the  regenerative  technique,  estimates  f(\j)  and 
*[rtX,)],  j  =  1,  •  •  •  ,  4,  were  constructed  from 
the  appropriate  sequences  of  simulation  output 
data.  The  methodology  employed  was  a  two-stage 
extension  of  the  methodology  described  in  Iglehart 
[1976];  see  Willie  [1986].  Results  of  another  study 
(again  see  Willie  [1986])  suggest  that  for  the  f(\j) 
and  rc[/(X;)]  in  this  example,  the  approximation 
(3.1)  is  quite  reasonable.  Alternative  estimation 
methodologies  are  developed  in  Heidelberger  and 
Lewis  [1984]. 

Computing  the  heavy  traffic  limit  for  our  system  is 
a  straightforward  application  of  the  material  in 
Reiman  [1985],  or  Simon  [1985].  If  W  is  the 
response  time  of  the  first  CPU  to  disk  for  a  type  4 
customer,  then 


Approximation  of  the  Normalized  0.99*  Quantile  Function 


' 

>  W> 

c-X 


~  £-0,239  r 


Thus,  we  have 


(arrivals  per  milliacoond) 

The  points 

(\>.  £(*■;))  ,  ;  =  1,  •  •  •  4,  and  (c,£(Xy)) 

are  displayed  with  solid  dots  in  Figure  7.1.  The 
verticle  bars  eminating  from  the  dots  extend  to 
§(\j)  ±  se  [g(X,)]  The  character  of  the  points 
suggests  approximating  the  normalized  quantile 
function  over  [X^,  c]  by  a  straight  line:  The  line 
in  the  figure  is  the  (linear)  approximation  |(1)(X), 
constructed  in  the  manner  described  in  Section  6. 


V  v  v 
&VA> 


The  vertide  bars  about  the  line  extend  to 
i(,)(V)  ±  «[$(1)(X/)]  for  a  selection  of  ky  in 
the  interval  [X.min,  c].  For  a  particular  arrival  rate 
ky,  say,  the  latter  verticle  bar  is  a  measure  of  the 
typical  error  in  £^\ky)  resulting  from  the  sampling 
errors  in  the  simulation  points  used  in  the 
interpolation.  Note,  this  should  not  be  confused 
with  the  error  in  §^\ky)  resulting  from 
approximating  the  unknown  g(ky)  by  a  linear 
function.  The  point  in  Figure  7.1  displayed  with  a 
drcle  is  a  normalized  quantile  estimate  from  an 
additional  simulation  experiment.  This  point  was 
not  used  in  the  construction  of  the  interpolation 
line. 

The  approximation  g('\k)  appears  to  be  a  very 
reasonable  description  of  g(k)  over  the  entire 
interval  [\min  ,  c]. 

Our  approximation  of  the  0.99rt  quantile  function, 
/ri)(\)  is  displayed  in  Figure  7.2.  The  points  and 
bars  in  Figure  7.2  were  obtained  by  renormalizing 
the  corresponding  points  and  bars  in  Figure  7.1. 

FIGURE  7.2 

Approximation  of  the  0.99*  Quantile  Function 


(arrivals  per  millisecond) 
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SEQUENTIAL  SIMULATION  RUN  CONTROL 
USING  STANDARDIZED  TIME  SERIES 


Lee  Schruben,  Cornell  University 


In  this  paper  rh^  basi  o  concepts  of  standard!  2ed  time 
presented.  A  detailed  sequential  confidence  interval 
presented  based  on  standardized  time  series. 


series  analysis  are 
estimation  procedure 


i  s 


I  MOTIVATION  AND  BACKGROUND: 

The  moti.ational  mater'  pcs$ftnted  in  this 
section  is  expanded  on  i  r  I  schruben,  19351. 
standard} z i ng  a  time  secies  is  similar  to  the 
familiar  procedure  of  standardizing  or 
normalizing  n  scalar  statistic.  Standardizing  a 
scalar  statistic,  such  as  a  sample  mean,  involves 
centering  the  statistic  to  ha  *e  a  zero  mean  and 
scaling  its  magnitude  to  generic  units  of 
measurement  called  standard  deviations.  Limit 
theorems  can  he  applied  that  give  us  the 
asymptotic  (large  sample)  probabilistic  behavior 
of  correctly  standardized  statistics  under 
certain  hypotheses  This  limiting  model  for 
scalar  statistic  is  typical  1:  the  Standard  Normal 
probability  distribution.  This  model  can  be  used 
for  statistical  inference  such  as  testing 
hypotheses  or  constructing  confidence  intervals. 
Her*-  extend  this  concept  to  the 
standardization  of  an  entire  time  series. 

The  value  of  standardizing  time  series  comes 
from  the  fact  that,  the  same  mathematical  analysis 
•  •an  be  applied  to  series  from  a  variety  of 
sources.  Thus  the  technique  of  standardization 
serves  as  a  mathematical  surrogate  for  experience 
with  the  data  under  study.  No  matter  what  the 
original  time  series  looks  like,  the  standardized 
i  me  series  will  be  familiar  if  certain 
hypotheses  are  correct.  Unusual  appearance  of  a 
standardized  time  series  can  be  used  to  conclude 
that  these  hypotheses  are  not  valid.  The 
statistical  significance  of  these  conclusions  can 
be  computed  in  the  same  manner  as  with 
standardized  scalar  statistics. 

1 .  1  standardizing  a  Scalar  Statistic: 

As  a  guide  to  standardizing  a  time  series,  we 

re. lew  the  procedure  of  standardizing  a  scalar 

statistic.  He  will  use  the  familiar  t -statistic 

as  an  example.  The  •lota  nil!  consist  ol  n 

observations,  Y„,  Y  .  ..,  Y  .  that  are 
1  2  n 

independent,  and  ha  /e  identical  distributions.  He 
wish  to  make  inferences  about  the  unknown 
population  mean  u.  The  sample  average  of  the 

data,  Y  .  will  be  t he  statistic  used  for  these 

n 

inferences.  The  population  variance,  o  ,  js  an 
unknown  nuisance  parameter. 


(STEP  2)  SCALE  THE  STATISTIC  MAGNITUDE:  express 
the  statistic  in  a  common  unit  of  measurement 
called  a  standard  deviation.  The  magnitude  of 
the  statistic  is  scaled  by  dividing  by  O/vn.  Our 
statistic  is  now 

Z  -  (  Y  -  a)  /(O/Jn) 

n  n 

which  is  our  standardized  statistic. 

Standardized  sample  means  will  all  have  the  same 
first  two  moments.  The  unknown  scaling 
parameter,  o,  can  either  be  estimated  or  it  can 
be  cancelled  out  of  a  ratio  statistic.  The  paper 
in  the  references  by  Glynn  and  Iglehart  discuss 
these  two  alternatives  from  a  theoretical 
viewpoint.  The  cancelling  out  of  this  parameter 
in  a  ratio  statistic  is  the  more  common  approach 
and  this  is  followed  here. 

( STEP  3)  CANCEL  THE  SCALE  PARAMETER,  the  data  is 
aggregated  or  batched  into  b  exclusive  adjacent 
groups  of  size  m  (assume  b  =  (n'ml)._  The 
average  of  each  batch  is  denoted  as,  Y^  m>  i  = 

1,2,  ...,b.  The  usual  unbiased  estimator  of  the 
.arianc*  of  the  batched  means  is 

-*  - 1  k  =2 

S*  =  (  n  —  1 )  Z  ( Y.  -  Y  ). 

i=1  l’ffi  n 

Inferences  about  the  parameter,  u,  are  based  on 
the  random  ratio, 

T  ,  =  ( <  ?  -  u) /<0  Vn> ) /J(  (  (  b-1)  S2) /<02<  b-l> ) ) 

b- 1  n 

*  (  Y  -  u)  /(  S,\'n> . 

n 


The  parameter,  0,  cancels  out  of  this  ratio. 


(STEP  4)  APPLY  LIMIT  THEOREMS 
distribution  of  T 


b-1  is  known. 


The  limiting 
As  n  *•  o&  (  making 


m  -•  co  sinqe  £  is  fixed)  the  distribution  function 
of  (b-1)3  /o  converges  to  that  of  a  v  random 

Also  as  n 
and  the 


variable  with  b-1  degrees  of  freedom. 
-  w  Y  will  converge  to  the  constant 


distribution  function  of  Z  will  converge  to  that 

n 

of  a  standard  Normal  random  variable.  Thus  the 
distribution  function  of  T^^  (being  a  continuous 

mapping)  will  converge  to  that  of  a  t  random 
variable  with  b-1  degrees  of  freedom. 


Standardization  m«ol»es  the  following  steps. 

(STEP  1)  CENTER  THE  STATISTIC :  the  population 

mean  is  subtracted  from  the  sample  mean  giving 

the  random  variable.  Y  !i  which  has  an  expected 
n 


(STEP  5)  USE  THE  LIMITING  PROBABILITY  MODEL  FOR 
I NFERENCE:  The  limiting  distribution  of  T.  ,  can 

D-l 

be  used  for  statistical  inference  and 
estimation. 


value  of  zero. 


1 .2.  Standardizing  a  Time  Series: 


A 


T(  k  m) 


The  concept  of  standardization  can  be  applied  to 
an  entire  time  series.  The  original  series  of 
observations  is  transformed  into  s  standardized 
series  of  observations.  He  will  hypothesize  (and 
test)  that  the  series  is  stationary.  He  also 
assume  that  there  is  some  minimal  amount  of 
randomness  in  the  process;  however,  we  do  not 
assume  that  the  data  is  independent.  The 
mathematical  assumptions  needed  are  given  in 
(Schruben,  19331  where  it  is  argued  that 
simulations  on  a  computer  will  meet  the  imposed 
restrictions  for  appl  l  cabi  1 1 1  > .  Let 
(  Y  .  i  =  denote  the  time  series.  He 

l 

will  standardize  the  sequ^jice  of  cumulative  means 
up  to  and  including  the  kc  observations,  given 
by, 

k 


Similar  steps  to  those  in  st andardi z i ng  a  scalar 
statistic'  are  followed  in  standardizing  a  1. 1  m- 
series.  These  steps  are  as  follows. 

(STEP  1)  CFWTFP  THE  3EP.TE3:  Th**  sequence  gi.*n  h.. 


will  have  a  mean  of  zero. 

1  STEP  2)  SCALE  THE  SERIES  MAGNITUDE  The  scaling 
constant  for  dependent  sequences  that  we  us**  is 
defined  as 


l  i  m  m  Vari  Y 
m  *® 


k  - 1 

(STEP  M)  APPLY  LIMIT  THEOREMS:  It  is  shown  in 

I  Schruben,  1QS3J  that  the  standardized  series, 

T  it*,  will  converge  in  probability  distribution 
m 

to  that  of  a  Brownian  Bridge  stochastic  process. 
Thus  the  Brownian  Bridge  process  plays  the  role 
in  time  s-rirs  standardization  that  the  normal 
random  variable  played  in  the  scalar 
st  andardi  zat  i  on.  An  the  important,  feature  of  the 

standardized  series,  T  •  t> ,  is  that  it  is 

m 

constructed  to  b»-  as/mptot  l ca  1  1 ..  independent  of 
the  sample  mean.  Y 

l ,  m 

There  are  sc- v  era  l  functions  r.f  T  1  t  )  thai 

>  * 

-ill  *ils*.  t-  as.  mpt -t  i  .'.ill  /  "  distributed.  The 

ar*  a.  ..ill  ha.*  ..  limiting  n<<rmol  distribution 

with  zoi'i  •  me-  q  n  and  , r  i  a  nc.»  V  -  1  **  *  .  (  B}  ^  m  ■ 

Ther* fore.  A’  V  will  have  a  limiting  • 
distribution  with  on*  degree  of  freedom. 

Now  consider  where  each  of  b  independent 
replication*,  (or  h  batches  of  data)  are 
standardized  in  t  he ^manner  above.  He  cat)  then 
add  the  resulting  <  random  variables.  ^  'V.  for 
each  replication  or  batch  to  obtain  a  »  random 
variable  with  b  degree-,  of  freedom  Also  each  of 
the  replication  of  hatch  means  can  be  treated  as 
.1  set  --f  scalar  fandom  enables  and  standardized 
giving  another  •  random  /ail  able  \  b  1 i  3  ‘.Tfc 
•  gi.en  above)  Due  t,.  t  h*-  independence  of  T  it' 

an*l  the  Y  ’  s,  these  two  .  "  random  variables  can 
l ,  m 

be  added  giving  a  «.  ‘  random  variable  with  2b-  1 

degrees  of  freedom.  Thi,s  can  be  considered  as  a 
"pooled0  estimator  of  ?c  which  we  will  denote  as 

Q'. 


which  is  just  the  population  variance  in  the 

special  case  of  independent  identically 

distributed  data.  Magnitude  scaling  iv  done  by 

dividing  3  »’ k)  b>  m- .t)  ‘k.  Again  the  scaling 
m 

constant  is  unknown  but  will  cancel  out  of  out 
statistics  as  before. 

Now  there  is  one  step  required  that  was  not 
necessary  in  the  scalar  s tandard i za t l on  case 
Different  time  series  can  t»*-  of  different  l*-ngth 
so  w •»  must  also  seal*-  th  «nd»-x  of  {  h-:  no1. 

Thu  .  « .  h.i  »  h  .*-11 1  *  i  -n  *  I  t  .  j. 

■'  r.Trr  y.r  i  f  the  ..mr  impk"  ::  .. «  i  i  d.  fm 

the  continuous  index.  t  k  m  fiiir  j-r*  »..u’. 

index  is  thus  gi  vt  n  b..  k  s  (  mt  1  He  a  Is*,  .old  t  h- 
starting  point  =  0  so  that  n  _t  _1.  The 

result  is  that  all  standardized  time  senes  have 
indices  on  the  unit  interval 

He  now  have  what  we  will  call  a  standardized  time 
series  given  by 

T  (  t )  =  ( f  m  t )  >  3  i  l  m  1 1  )  .  (  v  m(  o )  i 

m  m 

< 3TEP  3)  CANCEL  THE  3CALE  PAP.AHETEP:  There  are 
several  functions  that  might  be  considered  for 
the  denominator  of  of  a  ratio  that  cancels  (sec 
Schruben.  1933*  He  will  consider  here  only  »ne 
such  function,  the  sum  i  or  limiting  area  under 
t  he  function  T  (  t * 


(  STEP  5‘  USE  THE  LIMITING  PPOBiF.  II.ITY  MODEL  FOR 
f  NFEP  ENCF.:  Exactly  like  f*-.r  the  scalar  case,  t  lu¬ 
st  a  nda  rd  i  zed  •  S'\il  jp  sampl*  mean  of  all  of  t  he 
data  can  be  li  id-d  by  the  square  root  of  over 
2  b  1  to  form  a  ratio  i  independent  .if  the  scale 
parameter  0)  For  large  values  of  m  the 
distribution  of  this  ratio  -'an  be  accurately 
modeled  as  having  a  t  distribution  wit  h  .  b-1 

•  l**gr-’t-s  of  freedom  The  same  t/p*-s  of  inferences 

•  ■an  b**  made  for  »  h--  dependent  time  s  i  mutation 
output  sen  s  as  w.-ra  applicable  in  the 

i  nd*  pendent  data  cas.-  The  resulting  “t  .amat*" 
is  gi  -  en  b>  . 


Theor  ticril  properties  of  'onfidence  interval-, 
form-  -t  using  s  t  a  nd<i  r«1 1  /.  'd  time  scries  arr 
presented  in  t  Go  Ids  man  and  3c  hru  ben.  10.9  Ml  Thi*. 
pap.-r  -'ompares  the  st  andard  i  z--d  time  series 
approach  to  eon  /»*  n  t  i  una  l  methods 

'  \  P_P  L  ir  A  Tin  N3  0_F_  >T  A  NDA  P  PI  Z  ED  TIME  3EP.IE3 

Standardized  time  -*  r  i  ha-,  b*  »  n  implemented  in 
se.er.il  Simulation  •  i  n -1  1  *.  IS  packages  host 

notably  -it  IBM  I  He  i  dc  1  be  rg--  r  and  H-l.h.  19?11.  ..t 

Bell  l.abs  (  NriZ.il' l,  1‘^SI,  and  at  C,  F.  (  drhrub*  n. 

1°Sol  These  package*,  t.  pi  call  control 
initialization  bias  *  see  also  1  Achrubt-n.  3tngh 
and  Ti  ern*  . .  1‘'3;1  and  f  ,.<'hruh*  n.  1*3,1  and  run 


duration  as  well  as  produce  confidence 
interval*.  Other  applications  of  standardized 
time  series  have  been  to  selection  and  ranking 
problems  f  Goldsman,  193.11  and  simulation  model 
validation  f  Chen  and  Sargent.  1°»341. 

1.  AH  ALGORITHM  FOR  SEOHEMTIAL  SIMULATION  RUN 
CQNTPOL 

The  objective  of  the  simulation  run  is  to 
estimate  the  mean,  a  .  of  the  output  series. 

The  algorithm  is  based  on  standardized  time 
series  techniques  and  is  implemented  as  a 
proo^dur*  or  subroutine  that  can  be  used  with  an., 
simulation  model  that  generates  a  serir-s  of 
output  •  >bser .  at.  i ->ns.  This  procedur  -  is  o«*ll»d 
periodi  i-al  1  v  during  a  simulation  run  Lo  test  if 
either  of  t  m  run  termination  criteria  arc  mot. 

The  run  is  stopped  when  a  maximum  run  length  is 
reached  or  an  estimator  relati.e  precision 
criterion  is  satisfied.  The*  procedure  interrupts 
the  run  at  various  check  points,  truncates  the 
output  if  a  significant  initialization  bias  is 
detected,  computes  a  confidence*  interval  estimate, 
and  terminates  the  run  if  appropriate. 

The  test  for  significant  initialization  bias 
is  the  weighted  sum  test  for  initialization  bias 
( Schrub^n,  oingh,  Tierne> ,  193J1.  Confidence 
interval  estimates  are  computed  using  the 
combined  cl assioa I -area  confidence  interval 
estimator  (  f>chruben,  1°3>).  The  output  sequence 
is  arbitranl.-  broken  into  S  batches  which  gives 
o  degrees  'of -freedom  for  computing  the  confidence 
internal.  This  is  felt  to  be  a  sufficient  number 
of  degrees  of  freedom.  f3ohmeiser,  193.2).  The 
algorithm  uses  th~  sequential  structure  in 
Heidelberg^r  and  Welch  (1°3Vi. 

Run  Control  Procedure: 

Th-  experimenter  selects  the  following  input 
parameters  for  run  control 

confidence  coefficient  for  the 
confidence  intervals. 

'  •  acceptable  estimator  relative 

precision  f.ie  H  C  <  and 

n  ■  maximum  number  of  obs"r  .  at  i  on*.  in  run 
ma  x 

The  user  inputs  «*>,  ,  and  n^JV  and  initially 

ma  x 

sets  the  run  length  to  n  -  n  The  run 

ma  x 

c-.ntr-tl  prccedur*  is  «*al  1  *-d  after  u  total  oi  n 

obseiw.it  ions  are  g-^n-rat  >d  b.  the  simulation 

program  The  proc.:.lur»  ‘’it  her  terminates  the  run 

with  an  acceptably  precise  confidence  interval 

estimate  for  j  or  updates  the  run  length.  n  . 

If  th<*  value  ...f  n  return  1  •  xceeds  n  the  run 

ma  x 

is  to  be  terminated  „  i  i  h  a  message  that,  th*- 

r*  1  a  t  i  .e  pr**  c  i  «,  j  on  criterion  a  as  not  mot.  in  t  he¬ 
al  located  maximum  run  duration. 

Th  -  run  control  procedure  is  as  follows; 


STEP  1  Comput * 


i.  runca  t  i  •  -  n  point,  n 


then.  *.ei  n  1  .  Sn  and  PF.TWP.N; 

j»t  h*-ri*i  s-  .  compute  a  '  level 

-onf  l  dene.-  inter. al  .enter  point..  C. 

.nd  half  width.  H.  with  the  truncated 

*ut  put  sequence  ‘discard 

>|e.  r  at  i.  »n  ■.  up  t  <i  and  including  t  h* 


STEP  )  If  H.C  <  ;. 

then,  report,  a  confidence  interval 
with  H  ’ C  and  STOP; 
otherwise,  set  n  i.9n  and  RETURN. 

Truncat  ion  point  selection  ■  HTEP  1)- 

romp  lit  e  the  "olasscvil  sum"  estimate  of  o' 
using  only  the  l<ast  half  of  the  output,  and  b  5 
batches  Let  7\  again  be  the  jth  observation 
and  let  Y  be  i he  sample  mean  in  hatch  i.  The 
hatch  size  is  m.  That  is,  compute 


i  1  «?  A*"*  /  (  m  -  m»  ♦  mi  Y  -  Y)  c 


kel  J 

and 

Y  -  the  average  of  all  the  retained  data 

Note  that  t. he  expect-d  valuer  of  A  (?b  1<  is  **. 
ft  is  important  lo  implement  the  computation  of  A 
so  that  there*  is  no  numerical  overflow. 

Th-  truncation  point  is  selected  using  a 
recursion  that  is  equivalent  to  computing  the 
weighted-sum  initialization  bias  test  statistic. 
This  i  st  statistic  is  com  put  *1  starting  at  the 
»jnd  of  the  output  sequence  and  mo .  i  ng  toward  the 
beginning  of  the  run.  Th^  output  is  not 
batched.  Let  Y^  denote  the  output,  indexed  in 

re.erse  order  from  that  in  which  it  was 
generated.  The  recursion  starts  with  T(  1 )  -  0 
and  is  as  follows; 

Ti  J  f  1 »  5  Tfj  dM?)  +1)  o)  f  jYj  +  -;1UM  ^},  .) -1 .  .  .  .  .  n 


o(lM  =  4..  i  -  sum  of  the  last  .)  observations 

J  i-’1 


The  truncation  point  n  is  given  by  n  -  j  where 
J  is  sm.«l  l  •  st  i  nd*.  a  \h-  i-  f.-r  all  j  >  jA 


T'  '•  >  • i,h  '-..-h-i 


Thai  is.  truno.it-  I  I  d i  ns  xh-r-  Ti  )  • 

**ut  id-  it*s  .f>nf  i  done,  timii  and  stav- 

l  h-  r>- 

*"•  ni  i  -1-  n» i  n  t  -  r  .,  1  .  • .  i  i  m,,t  i . . n_  vr F p  ■  i 

11  *  - 1  n  -j  t  ti  •  i  r  un<  .i  •  ed  « -at  j-u  i  ■ .  -  qu  -  nc  *  . 

•  ‘  ^I'Ui  u*.ing  I  h>  i  * 1 1  ■  m  •  j  1  1 1  si...  n  ,(| .  Tt 

1  1  •  •  -  m'  i  -1  n-'.  i  n  <  •  r  ..  I  ~*.  1  i  ma  t .  c  i 


d.»V.HK  rChiFK.-. 


1  •  Th--  s  ■  pi^n i.  l  -i I  *.  i  mu  l  a t  i  •  >n  run  •*« .n t  i  « ■  I 
i  >  j  ...  -  a  ■  J  u  i  ■  -  or  n*  •  in  .n-h'-n  /  h.»s  i 
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ON  SOME  STATISTICAL  ISSUES  IN  SIMULATION  STUDIES 


C.  L.  Mallows  and  V.  N.  Nair,  AT&T  Bell  Laboratories 


1.  Introduction 

We  point  out  some  statistical  techniques  that  have  not 
been  fully  exploited  in  simulation  studies,  and  show  how  they 
work  in  two  very  simple  examples.  The  techniques  we  shall 
discuss  are  data-analysis,  experimental  design,  and  smoothing. 
Many  workers  in  statistics  have  begun  to  move  away  from  the 
use  of  rigid  models  for  data  towards  more  flexible,  data- 
determined  models.  This  goes  along  with  a  more  algorithmic 
view  of  the  process  of  data  analysis.  Often,  in  a  simulation 
study,  the  purpose  is  to  obtain  qualitative  understanding,  and 
not  (at  least  originally),  precise  estimates  of  model 
parameters.  Thus  these  modern  attitudes  become  relevant. 

2.  A  Birth-Death  Process 

For  our  first  example,  consider  a  simple  birth-death 
process,  with  state-dependent  probabilities.  Thus  the  state 
space  is  the  non-negative  integers,  and  for  k  — 1,2,... 

P{*(H-l)-*  +  l|;if(/)-*)-p* 

-\-‘qk-\~P{X(t  +  \)-k~\\X(t)-k) 

while  for  k  —  0  the  same  is  true  except  that  q0  is  the 
probability  that  X(t  +  l)  —  0,  given  that  X(t)—0  The  problem 
is  to  estimate  the  stationary  distribution  {»■*},  or  its  moments 
or  quantiles.  We  view  this  toy  problem  as  a  model  of  part  of 
a  larger  system,  so  that  we  would  not  know  the  p’s,  though  of 
course  to  simulate  the  system  we  have  to  choose  values  for 
these  parameters;  in  fact  we  shall  always  choose  to  make 
them  all  equal  to  some  value,  p . 

If  we  start  a  simulation  run  at  A(0)-x0  and  run  for  A 
steps,  we  can  compute  the  "histogram"  estimate 

i?,ST-Nk/N 

where  A*  is  the  number  of  times  the  simulation  visited  the 
state  k .  However  several  other  estimates  are  available.  From 
the  conservation  equation 

*kPk  “x*  +  l‘?*+l 

which  simply  says  that  in  the  long  run,  for  every  time  the 
sample  path  goes  up  from  k  to  k  + 1,  there  must  be  a 
compensating  step  from  k+ 1  to  k,  we  have  that  the  stationary 
distribution  is  given  by 


Lemma:  If  X(N)  — A"(0),  *H,ST -*PHAT . 

If  X(N)  does  not  equal  X(0),  the  two  estimates  differ 
very  slightly.  Now  we  introduce  some  new  estimates.  If  we 
know  that  pk  does  not  exceed  'h,  we  can  replace  pk  by 

pT*UNC_min(,A  -k) 

and  hence  get  a  "truncated"  estimate 

- TRVNC  ,-TRUNC  - TRUNC  \ 

Tk  -**  V>0  . . > 

Also  we  can  smooth  the  raw  p’s,  obtaining 

-SMOOTH _ (-SMOOTH  - SMOOTH  \ 

*k  ,P  1  ,-J 

Finally,  for  calibration  we  can  consider  the  maximum 
likelihood  estimate 

-p)p* 

where 

p-p/q-A+/A~. 

We  ran  100  simulations,  each  of  length  A— 1000,  for 
several  values  of  p.  For  each  run,  we  computed  five  estimates 
of  the  mean  position 

M-p/(l— 2p)-2*»*. 

namely  iiHIST ,iiTRUNC ,UML ,  and  two  versions  of  ^SMOOTH  t  the 
first  obtained  by  fitting  a  logistic  regression  to  the  raw  p’s  and 
the  second  by  fitting  local  logistic  regressions  with  window 
width  Vi  A.  Thus  for  this  last  estimate,  for  each  k  we 
determined  a  window  of  values  of  k  that  included  at  least 
A/4  epochs  on  each  side,  unless  fewer  than  that  were 
available. 

Table  1  gives  the  corresponding  means  and  mean-square 
errors  for  two  values  of  p,  namely  0.35  and  0.45. 

We  see  that  for  p— .35  the  maximum  likelihood  estimate  is 
by  far  the  best,  while  all  of  the  adjusted  estimators  do  better 
than  the  crude  HIST  estimate.  For  p— .45  the  maximum 
likelihood  estimate  has  a  large  mean  square  error,  due  to  a 
few  cases  where  p  is  close  to  Vi,  while  each  of  the  other 
estimators  has  a  much  smaller  mean  square  error,  while  being 
considerably  biased  towards  small  values  of  p.  We  are 
searching  for  ways  (hopefully  of  general  utility)  of  reducing 
the  bias  of  these  adjusted  estimators. 


**  ~Tk  W/>|.  '  '  '  )  ~c 


P oPi  '  '  '  Pk-\ 


’  ’  ’  dk 

where  the  normalizing  factor  c  is  determined  so  that  2**“ L 
Notice  that  this  result  will  hold  even  if  the  transition 
probabilities  are  not  constants,  provided  we  interpret  pk  as  the 
average  probability  of  the  corresponding  transition. 

We  define  the  "PHAT"  estimate  as 

*k  *k  V’O’Pl,— ' 

where 

Pk-Nk/Nk 

and  Nk  is  the  number  of  times  (out  of  A* )  that  the  sample 
path  left  k  by  passing  to  /fc  +  l  (and  not  to  /(-I).  We  have  the 
simple 


In  general,  we  suggest  that  a  similar  strategy  based  on 
smoothing  quantities  a  little  beneath  the  surface  of  the  raw 
simulation  output  may  prove  rewarding. 

Once  we  think  of  smoothing  pk  rather  than  the  raw  ik,  we 
can  bias  the  simulation  to  make  it  more  efficient  for 
estimating  the  pk  s.  At  an  extreme,  we  can  choose  to  make 
A  separate  one-step  runs,  starting  at  k  exactly  nk  times, 
where  '£nk—N.  The  nk' s  are  at  our  choice,  and  we  can 
choose  them  for  efficient  estimation  of  p  or  any  other 
interesting  quantity.  Notice  that  this  is  not  simply 
importance  sampling;  here  we  can  choose  to  put  zero  weight 
at  some  k's,  and  estimate  the  corresponding  pk  s  by 
smoothing.  With  the  usual  formulation  of  importance 
sampling,  such  a  design  would  lead  to  an  infinite  variance.  A 
simple  calculation  gives  the  following 
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Theorem:  The  optimal  design  for  estimating  p(ir)—'^kirk, 
using  the  estimator  p —^krkHAT,  when  truly  pk  —p  for  all  k, 
is 

(2Hc  +  l)irjt 
&k  ~ ~2^+\ 

The  efficiency  gain,  relative  to  bk  — rrk,  is 


varip  |i*  —  xk) 
var(p\&k  —opt) 


-2- 


I 


(1+2m)2 


In  a  real,  complicated  system,  similar  efficiency  gains 
might  be  realized  by  sequential  design.  A  simpler,  more 
general  strategy  is  the  following.  Every  time  the  simulation 
gets  into  an  "interesting''  state  (i.e.  one  that  is  influential  with 
respect  to  the  quantity  of  interest),  we  spawn  several  short 
daughter  runs  and  hence  improve  the  accuracy  of  estimation 
of  the  local  parameters  (the  pk's  in  our  toy  problem.)  Notice 
that  this  is  not  simply  the  classical  "splitting"  strategy.  We  do 
not  assign  fractional  weight  to  the  daughter  runs,  but  use  each 
realized  step  with  equal  weight  in  estimating  the  local 
parameters. 

3.  A  Processor  Sharing  Queue 

For  the  second  example,  consider  a  M/M/I  —  PS  queue 
with  Poisson  arrivals  at  rate  p  and  exponential  service  with 
mean  one.  Under  this  processor  sharing  queue  discipline,  if 
there  are  k  jobs  in  the  system,  each  job  receives  service  at 
rate  Mk.  The  problem  is  to  obtain  the  equilibrium  distribution 
of  the  sojourn  time  W.  Notice  from  Little's  law  that  the 
expected  value  of  W  under  the  processor  sharing  discipline 
equals  that  under  the  FIFO  discipline,  and  is  given  by 
l/(l-p).  Coffman  et  al.  (1970)  derived  the  Laplace 
transform  of  the  distribution  conditioned  on  its  required 
service  time.  From  this,  it  is  possible  to  obtain  the  variance  of 
IF.  More  recently,  Morrison  (1985)  has  obtained  the 
following  expression  for  the  distribution  of  W . 

P(W>t)- 


We  simulated  100  values  of  Wk  for  values  of  k  ranging 
from  0  to  20  and  for  p  from  .1  to  .9.  We  simulated  the 
process  with  p—  .9  and  thinned  it  to  get  the  IP’s  for  the  other 
p  values.  This  was  done  to  induce  a  high  correlation  among 
the  simulation  results  for  different  p’s.  A  plot  of  the  means  of 
versus  k  and  a  least-squares  analysis  showed  that  the 
conditional  expectation  of  W  given  k  can  be  well 
approximated  by  a  linear  function  in  k .  By  plotting  the  least 
squares  coefficients  as  a  function  of  p  and  by  using  the  results 
for  the  limiting  cases,  we  arrived  at 

E(Wk)-(k+2)/(2-p), 

which,  in  fact,  is  the  correct  answer  (Coffman  et  al.,  1970).  A 
plot  of  the  standard  deviations  of  Wk  versus  k  showed  that 
the  a  linear  approximation  is  reasonable  except,  perhaps,  for 
k  near  0.  But  this  suggested  that  it  may  be  better  to  consider 
the  distribution  of  logOF4)  for  which  the  variance  will  be 
independent  of  k.  Analysis  of  the  means  and  variances  of 
logOFj)  showed  that  the  mean  is  linear  in  log(fc+2)  and  the 
variance  is  constant  in  k ,  except  for  k  near  0. 

Recall  that  for  p  near  one  and  for  k  large,  fFk  is 
approximately  distributed  as  k  X  where  X  is  an  exponential 
random  variable.  Similarly,  for  p— 0  and  k—  0,  W0  has  an 
exponential  distribution.  For  p— 0  and  k  large,  the  mixture 
distribution  can  be  approximated  by  (A:  +  1)1/  where  V  is  a 
uniform(O.l)  random  variable.  These  suggested  the  following 
approximation 

logOn)  =  log(a)  -Hog(/>) -Xh  h 

where  a  and  b  depend  on  p  and  k  and  X ’b.j,  is  the  largest 
order  statistic  from  b  independent  exponential  random 
variables.  Table  2  gives  the  values  of  a  and  b  for  the  limiting 
cases.  By  considering  these  limiting  values  and  equating  a 
and  b  with  the  expectation  of  Wk  using  the  above 
approximation,  we  obtained 

a  —k  +  \+p 

and 


1  f  exp|-fl[2p''',-(H-p)cosfl]/(l-p)sinfll 
o  ( 1  -p)  ( 1  +exp{-*[ 2 \ph—  ( 1  +p) cost)  1/(1  -p)sin9)) 

x  exp{-(  1  -p2)t/(  1  +p-2p 'icos0)lsinW9 . 

We  now  describe  an  approach  that  combines  data  analysis 
with  prior  information  about  limiting  cases  to  obtain  a 
reasonable  approximation  to  this  distribution. 

Let  k  be  the  queue  length  when  a  tagged  job  joins  the 
queue.  We  shall  examine  the  conditional  distribution  of  W 
given  k.  We  can  obtain  the  unconditional  distribution  easily 
from  this  since  the  queue  length  distribution  is  geometric  with 
parameter  p  (the  same  as  under  the  FIFO  queue  discipline). 
Notice  that  as  p— *0,  the  conditional  distribution  of  Wk  tends 
to  the  mixture  distribution  with  density 

/w1(w’)-[exp(-w’)+wexp(-w)  +  ■  ■  •  +w‘exp(-w)/A:!]/(A;  +  l). 

This  follows  since  for  small  p  there  are  essentially  no  arrivals, 
and  with  probability  l/Ot  +  l)  the  tagged  job  will  be  the  y'th 
one  to  receive  service  and  so  the  sojourn  time  will  be  the  sum 

of  j  exponential  random  variables,  j—\ . Ar  +  1.  For  p— 1,  it 

can  be  shown,  and  a  heuristic  argument  can  be  used  to 
convince  the  reader,  that  Wklk  tends  in  distribution  to  an 
exponential  random  variable. 


(k  +p)  ( I  —  p) 

Figure  1  shows  the  quantile-quantile  (Q-Q)  plots  of  100 
simulated  W's  against  the  quantiles  from  the  approximating 
distributions.  The  plots  for  k  —  0, 2  and  8  and  p—.  1 ,  .5  and  .9 
are  given  in  Figure  I.  We  see  that  all  the  plots  are 
approximately  linear  with  slope  1  and  intercept  0.  There  is  a 
slight  nonlinearity  in  the  lower  tail,  especially  for  k—  0. 
However,  the  approximation  seems  reasonable  overall,  and 
more  extensive  plots  for  other  values  of  p  and  k  confirmed 
this  finding.  We  can  now  use  this  approximation  to  the 
distribution  of  fFk  to  easily  determine  quantities  of  interest 
such  as  quantiles  which  would  be  much  harder  to  obtain  using 
the  expression  given  by  Morrison  (1985). 

4.  Concluding  Remarks 

We  have  demonstrated  the  use  of  some  common  statistical 
techniques  through  two  simple  examples.  We  believe  that 
similar  approaches  hold  promise  in  more  complex  systems. 
For  example,  in  the  processor  sharing  case,  we  are  trying  to 
find  a  tractable  approximation  to  the  joint  distribution  of 
(Wk,l)  where  l  is  the  number  of  jobs  that  are  served  before 
the  tagged  job.  This  would  enable  us  to  incorporate  PS  nodes 
in  complex  networks.  Similarly,  we  view  the  simple  birth- 


irwwwwvi’jtyfwJ''  WJf 


k  -  *  ■■  >r"  vr»  ir»  w  W  win  lVA1 IV  W,  V.  W  M.*  WWM  WH.,i.,k,H«ll»VW 

%*r Ui 

El 


References 

Coffman,  E.  G.,  Muntz,  R,  R.,  and  Trotter,  H.  (1970) 
Waiting  time  distributions  for  processor-sharing  systems,  J. 
ACM,  17,  pp.  123-130. 

Morrison,  J.A.  (1985)  Response-time  distribution  for  a 
processor-sharing  system,  SIAM  J.  Appl.  Moth.,  45,  pp.152- 
167. 


Table  I 


*  HIST 

-m.vc 

£. SMOOTH 

£ SMOOTH 

-ML 

p  — .35 

mean 

1.170 

1.142 

1.131 

1.150 

1.162 

M—  1  167 

m.s.e.  x  102 

5.15 

3.24 

3.5 1 

4.19 

1.72 

p  —  .45 

mean 

4.07 

3.36 

3.83 

4.07 

4.86 

m-1.45 

m.s.e. 

2.27 

1.86 

1.95 

2.43 

5.04 

death  model  in  Section  2  as  a  node  embedded  in  a  larger 
system.  The  use  of  smoothing  techniques  and  designed 
simulations  can  improve  the  efficiency  of  the  estimators  in 
such  cases. 
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Abstract 

We  describe  the  various  techniques  that  have  been 
proposed  for  constructing  non-parametric  confidence  in¬ 
tervals  using  the  bootstrap.  These  include  bootstrap  piv¬ 
otal  intervals,  percentile  and  bias-corrected  percentile  in¬ 
tervals,  and  non-parametric  profile  likelihood  intervals. 
These  methods  are  small  sample  improvements  over  the 
usual  9  ±  co  intervals.  We  discuss  them  in  detail,  out¬ 
lining  the  underlying  assumptions  in  each  case.  Finally, 
the  various  intervals  are  compared  in  a  small  simulation 
study. 

1.  Introduction. 

Recently,  a  number  of  techniques  have  been  pro¬ 
posed  for  constructing  confidence  intervals  using  the  boot¬ 
strap  (see  Efron  1981,  1985,  Schenker  1985,  DiCiccio  and 
Tibshirani  1985, 1986).  These  techniques  are  non-parametric 
in  nature,  and  are  designed  to  work  well  over  a  wide  va¬ 
riety  of  situations.  Because  they  are  based  on  the  boot¬ 
strap,  they  can  be  used  in  situations  in  which  the  “param¬ 
eter”  is  an  extremely  complex  functional  of  the  distribu¬ 
tion  and  an  exact  analysis  would  be  impossible.  In  this 
paper,  we  describe  and  compare  these  bootstrap  meth¬ 
ods. 

2.  The  Problem  and  Some  Notation. 

We  observe  zi,...z„  assumed  to  be  realizations  of 
random  variables  Xj,  .-.Xn  ~  i.i.d  F .  The  distribution  F 
is  unknown  and  the  problem  is  to  construct  a  confidence 
interval  for  the  parameter  9  =  5(F).  By  a  confidence  in¬ 
terval,  we  mean  lower  and  upper  points  L  =  L(x i,...zn) 
and  U  =  U(x i,  ...z„)  such  that  P(L  <  9  <  U)  =  1  -  2a, 
where  P(  )  denotes  probability  under  the  true  distribu¬ 
tion  F.  Since  the  intervals  are  to  be  non-parametric,  we 
would  ideally  require  that  this  hold  for  all  F .  We  will 
confine  our  discussion  to  central  intervals,  i.e.  intervals 
(L,U)  such  that  P(9  <  L)  =  P{9  >  U)  =  a.  Non-central 
intervals  can  be  obtained  through  obvious  modification. 

Given  Xi,Xj,...Xn,  (X,  can  be  a  scalar  or  vector 
random  variable),  we  estimate  9  by  9  —  9{F* )  where 
F*is  the  empirical  distribution  function  of  Xj,  ...X„.  The 
observed  value  of  9  is  90j,  =  9(F„)  where  F„  is  the  em¬ 
pirical  distribution  function  of  zi,...zn. 


We  let  W  be  a  random  vector  with  Wx  >  0,  5D"  = 

1  and  w  be  a  realization  of  W .  Let  Fn(w)  be  the  distri¬ 
bution  putting  mass  ttij  on  zt,  i  =  1,2,  ...n.  Many  of 
the  techniques  will  utilize  “bootstrap  sampling* —  that 
is,  sampling  from  zi,zj,  ...z«  with  replacement.  This  is 
equivalent  to  sampling  W  from  the  rescaled  multinomial 
Mult(n,ra°)/n,  where  w°  =  (1/n,  l,n,  ...1/n).  We’ll  use 
A  to  indicate  bootstrap  sampling  and  a  bootstrap  value 
obtained  in  this  way  will  be  denoted  by  9'  —  9(F„(is)). 
We’ll  refer  to  a  bootstrap  sample  either  by  its  weight  vec¬ 
tor  is,  or  by  X’  =  (XJ.XJ,  ...X”).  Finally,  B  will  denote 
the  empirical  distribution  function  of  9'  under  ~  (“the 
bootstrap  distribution”). 

3.  Overview. 

Frequentist  confidence  intervals  are  usually  based 
on  a  test  function,  say  t(X,9]),  appropriate  for  testing 
H  :  9  =  9 j.  The  interval  is  constructed  as  follows.  For 
each  trial  value  9j,  we  include  9t  in  our  confidence  inter¬ 
val  if  we  would  accept  H  in  a  1  -  2a  size  test  based  on 

t(X,9i).  This  procedure  requires  knowledge  of  the  dis¬ 
tribution  of  t{X,9i)  for  each  9X.  Usually,  a  simplifying 
assumption  is  made —  that  t(X,9x)  is  pivotal,  that  is,  has 
a  distribution  not  depending  on  9X.  With  this  assump¬ 
tion,  it  is  not  necessary  to  consider  each  trial  value  9j 
separately.  We  assume  some  parametric  distribution  for 
t(X,9i),  then  invert  the  pivotal  to  yield  the  confidence 
interval.  A  simple  example  is  Xi,Xa,...X„  ~  A (9,1). 
Then  a  confidence  interval  for  9  is  found  by  inverting  the 
pivotal  X  -  9,  whose  distribution  is  X(0, 1/n). 

The  Bootstrap  Pivotal,  Percentile,  Biat- Corrected  Per-] 
centile,  BCa  and  BC°  intervals  (Sections  4  and  5  )  are 
non-parametric  analogues  of  parametric  pivotal  intervals. 
The  pivotal  distribution  is  not  assumed  known;  instead  it 
is  estimated  non-parametrically  using  the  bootstrap.  In 
Sections  4  and  5  we  provide  the  “recipes”  for  construct¬ 
ing  these  intervals  and  outline  the  underlying  assump¬ 
tions.  In  Section  6,  we  discuss  the  appropriateness  of  the 
various  intervals  in  a  few  simple  problems. 

In  Section  7  we  describe  a  different  approach  to  non- 
parametric  confidence  interval  construction,  through  like- 
liihood  methods. 

In  Section  8  we  compare  all  the  intervals  in  a  nu¬ 
merical  example. 


4.  Bootstrap  Pivotal  Intervals. 


4.1.  The  Simple  Pivotal 

We  assume  that  8  —  9  is  a  pivotal  quantity,  that  is 

8-8  ~  H  (41) 

where  H  is  a  distribution  not  involving  8,  and  also  that 
approximately 

8 '  -  L.  ~  H  (42) 

Assumption  (A2)  is  based  on  the  premise  that  if  F„  is 
close  to  F,  the  bootstrap  distribution  of  8'  -  8ct,  will 
be  close  to  that  of  8  -  8,  as  long  as  8()  is  a  reasonably 
smooth  functional.  Of  course,  if  H  is  a  continuous  dis¬ 
tribution,  then  (A2)  is  at  best  an  approximation,  since 
the  bootstrap  distribution  is  necessarily  discrete.  The  in¬ 
tervals  described  in  this  section  and  the  next  section  all 
use  this  kind  of  bootstrap  approximation.  To  simplify 
the  notation,  we  will  ignore  the  fact  that  it  is  only  an 
approximation. 

Under  (Al)  and  (A2),  we  have  l-2a  =  P(H~l(a)  < 
8-8<H~l(  1  -  a))  =  P(8  -  ff-‘(a)  <8  <8-  #-*(1  - 
a)). 

Substituting  80i,,  for  8  and  noting  that  //“’(■)  = 
B"l()  -  80i,,  we  obtain  the  Bootstrap  Pivotal  interval: 

8  €  (2 80t,  -  B-‘(  1  -  a), 29*.  -  B_1(a))  (1) 

4.2.  Other  Pivotals 

The  bootstrap  pivotal  interval  can  be  based  on  an 
arbitrary  pivotal  t(X,8),  as  long  as  it  is  monotone  in  8. 
We  assume  t(X,8)  ~  H,  t(X',80i,)  ~  H,  where  t(X,8)  is 
monotone  decreasing  in  8.  Inverting  the  pivot  as  above, 
we  obtain 

8  €  (tj-H/r^l  -  a)),t;'(H-'{a)))  (2) 

where  tj  ,(-)=inverse  of  <(-,-)  with  respect  to  the  second 
argument. 

The  bootstrap  pivotal  interval  is  used  by  Efron  (1981) 
in  the  form  of  a  “bootstrap  t”  and  by  Schenker(1985), 
who  calls  it  the  “substitution  method”.  We  have  in¬ 
troduced  the  obvious  name  “bootstrap  pivotal  interval” 
here. 

4.3.  The  Role  of  Nuisance  Parameters 

We  can  think  of  an  arbitrary  distribution  G  as  con¬ 
sisting  of  two  parts,  say  G  =  (8,  A),  where  8  =  8(G)  is  the 
parameter  of  interest  and  A  =  A(G)  is  a  vector  of  nuisance 
parameters,  possibly  infinite  dimensional.  The  true  dis¬ 
tribution  can  be  written  as  F  =  (8tru,,\tru*)-  With  this 


decomposition,  we  can  say  more  clearly  the  meaning  of 
the  statement  “t(X,8)  ~  H,  H  not  involving  8’ .  What 
we’re  really  assuming  is  that  F  is  a  member  of  some  fam¬ 
ily  of  distributions  7  existing  in  the  space  of  possible  dis¬ 
tributions.  The  members  of  7  correspond  to  different  8 
values  and  are  characterized  by  the  property  t(Jf,0)  ~  H 
Because  of  this  pivotal  assumption,  we  don’t  have  to  know 
the  structure  of  (or  estimate)  the  entire  family  7.  Only 
a  single  member  of  7  need  be  estimated.  The  empiri¬ 
cal  distribution  function  F„  estimates  that  member  (i.e. 
(F„  =  (8otlt  A^,)),  and  from  this  we  obtain  the  distribu¬ 
tion  H.  By  construction,  the  interval  will  have  correct 
coverage  for  F  €  /. 

A  family  like  7  also  underlies  the  percentile  and 
bias-corrected  percentile  intervals  (discussed  next). 

4.4.  Some  theory 

The  work  of  Singh  (1981),  Abramovitch  and  Singh 
(1985),  Beran  (1984)  and  Hartigan  (1986)  suggest  that  a 
bootstrap  pivotal  interval  based  on  the  pivot  (8-8)/SD(8) 
will  be  accurate  to  Op(  1/n)  (under  regularity  conditions) 
for  any  8.  For  8  =  E(X),  the  obvious  estimate  for  SD(9') 
is  JDJ (xj  -  i*)J/nJ  and  Singh  shows  that  this  leads  to  an 
interval  correct  to  Op(l/n).  Unfortunately,  for  non-linear 
statistics  calculation  of  SD(8’)  requires  a  bootstrap  com¬ 
putation,  and  thus  the  entire  procedure  becomes  a  “dou¬ 
ble  bootstrap” .  At  the  present  time  this  procedure  is  too 
expensive  computationally  except  for  small  problems. 

5.  Percentile  Intervals. 

5.1.  Uncorrected  Intervals 

Here  we  assume  Al  and  A2,  and  further  that 

H  is  symmetric  around  0  (43) 

In  this  case,  the  pivotal  interval  (1)  becomes: 

9  e  (£->(<*), B^l-*))  (3) 

Efron  calls  this  the  Percentile  Interval  since  it  uses  the 

percentiles  of  9*  as  “percentiles”  of  8. 

5.2.  Generalisation  of  the  Percentile  Interval 

If  a  symmetric  pivotal  exists  on  some  other  scale, 

i.e. 

g(8)-g(8)~H  (44) 

and 

9(*‘)  -  9(0o».)  ~  H  (*5) 


’  V"C1  ^  a."  x-w  v.ymv.w»vv.1 


with  H  symmetric  around  0  and  j(-)  is  an  unknown, 
monotone  increasing  function,  then  as  in  (3)  we  get  as 
an  interval  for  g(8): 

,(«)€(G-‘(a),C-1(  I -a))  (4) 

where  G  is  the  distribution  function  of  j(0*).  Transform¬ 
ing  back  to  the  0  scale  gives 

•  e(r1(d-,W.f-,(<5',(i'«))  (5) 


fe(r'(o),r1(i-a))  (e) 

which  i*  again  the  percentile  interval.  Thus  the  percentile 
interval  has  the  correct  coverage  if  a  symmetric  pivotal 
exists  on  any  scale.  Conveniently,  we  don’t  have  to  know 
</(•)  because  the  resultant  interval  doesn’t  depend  on  j(-). 

There  is  a  simple  connection  between  the  bootstrap 
pivotal  interval  based  on  0-0  and  the  percentile  interval. 
Writing  -  B-‘(l  -  a), 29 oh.  -  as  - 

[B'1(l-a)-90»,]),90j,  +  [9ot.  -  B~ 1  (a) j ) ,  we  see  that  the 
percentile  interval  is  the  bootstrap  pivotal  interval  reflected 
about  the  point  0O* , . 

5.3.  Bias-Corrected  Percentile  Intervals 


5.3.1  Normal  Correction. 


PW)-g[e)  <  9(0-9(*,*.))  =  *(9(0 -»&».)+*)  (9) 
and  from  (8)  we  obtain 

B(0  =  P0‘  <  0  =  P(9(h  <  9(0)  =  *(9(0-9(»o*.)-0 

(1°) 

Solving  for  g(t)  -  g(8obt)  in  (10)  and  substituting  into 
(9)  we  have 

P{g(0)-g{«)  <  g(t)-g(»oh.))  =  *(*-‘(B( t))+2b)  (n) 

Finally,  to  get  a  1  -  2a  percent  confidence  interval,  we 
set  the  right  side  of  (11)  equal  to  a  and  1  -  a,  and  solve 
for  t  to  obtain 

8  €  (B'1(*(sa  -  26)),  B'I(*(n-a  -  26)))  (12) 


where  zt  denotes  the  pth  quantile  of  4>.  Interval  (12)  is 
called  the  Biot- Corrected  Percentile  Interval.  The  para¬ 
metric  assumption  M(u,  1)  turns  out  to  be  not  as  restric¬ 
tive  as  it  appears.  If  we  instead  let  H  =  JV(u ,<rJ),  with 
unknown,  and  repeat  the  above  derivation,  we  get 
^  =  o  =  ~*  ’(^(So*.))  and  we  obtain  the  same  interval 
(12)  . 

Note  then  when  6  =  0,  the  bias-corrected  percentile 
interval  reduces  to  the  percentile  interval.  Hence  we  can 
think  of  the  bias-corrected  interval  as  a  “fine-tuning”  of 
the  percentile  interval. 


If  the  distribution  H  in  A4  and  A5  is  symmetric 
around  u  0,  the  percentile  interval  will  be  biassed  and 
will  not  have  the  correct  coverage.  This  would  occur  as 
a  result  of  bias  in  the  estimator  8.  It  turns  out  that  if 
we  are  willing  to  assume  a  parametric  form  for  H,  then  u 
can  be  estimated  and  a  corrected  interval  can  be  derived. 
As  was  the  case  for  the  percentile  interval,  the  corrected 
interval  will  not  depend  on  the  transformation  9(-). 

Since  P[g0’)  <  g(9ob,))  —  P0 *  <  0„*,),  we  can  use 
the  latter  to  estimate  the  bias.  Using  this  correction,  we 
then  match  the  distributions  of  g0)  -  g(9)  and  g(8')  - 
90oh$)  on  the  g(-)  scale,  then  transform  back  to  the  8 
scale. 

As  an  example,  suppose  we  choose  H  =  ,V(u,  1). 

Then 

9(0)  -  9(0)  -  M( 0, 1)  -  u  (7) 


80‘)  ~  90 oh.)  ~  A1(0, 1)  +  u  (8) 

We  can  solve  for  u  by  noting  that  P(g0')  <  g0oh.))  = 
*(-«)  =  C(9(0.*,))  =  &0oh,)  so  that  6  =  u  =  -*-‘(^(0o*,)l 
(*(•)  denotes  the  cumulative  distribution  function  of  JV(0, 1)) 
Now  from  (7) 


5.3.2  Other  Symmetric  Location  Scale  Familiee. 

In  the  bias-corrected  interval  above,  we  can  just  as 
well  assume  that  H  is  some  other  symmetric,  location 
scale  family,  say  H{x  |u,<r)  =  B0(^).  This  gives  the 
bias-corrected  interval 

9  e  (B~l{Ho(ha  -  26)),B‘1(ff0(^i-«  -  26)))  (13) 

where  6  =  —Hq  hp  denotes  the  pth  quan- 

tile  of  Hq. 

A  natural  question  to  ask  is:  how  much  difference 
does  the  choice  of  Ho  make?  Natural  candidates  to  com¬ 
pare  with  the  normal  are  symmetric,  long  tailed  distribu¬ 
tions.  Benjamini  (1983)  provides  an  appealing  definition 
of  long-tailedness.  Suppose  F  and  G  are  both  symmetric 
about  the  origin.  Then  G  is  said  to  otretched  (or  long 
tailed)  compared  to  F  if  G~I(p)/F~,(p)  is  an  increasing 
function  of  p,  for  1/2  <  p  <  1.  This  definition  reflects 
the  intuitive  meaning  of  long-tailedness,  that  the  quan¬ 
tiles  irG  are  “farther  out”  than  those  of  F.  Under  this 
definition,  distributions  like  the  t,  logistic  and  cauchy  are 
stretched  with  respect  to  the  normal,  as  we  would  expect. 


Now  suppose  Hq  is  stretched  with  respect  to  As¬ 
sume  B{@0t,)  —  g  >  .5,  so  that  90h,  is  biassed  upward, 
*nd  b  =  —  4>_1(B(0))  <  0.  Then  the  bias  correction  un¬ 
der  Ho  will  be  in  the  same  direction  as  the  bias-correction 

under  but  will  be  smaller.  The  proof  of  this  fact  is  eas¬ 
ily  derived  from  Benjamini’s  definition  above.  Denoting, 
as  before,  the  pth  quantiles  of  $  and  Ho  by  zp  and  hp 
respectively,  we  note  that  Ho(ha  +  2 hq)  >  a.  Hence 

*-*(#0(^  +  2  h,))  »-«(„)  Z* 

Hol(Ho(ha  +  2/»?))  HgHa)  h° 

This  implies  $_1(tf0(>>a  +  2 hq))  <  za  +  2 za(hjha)  < 
za  +  2 zq.  Thus  $(z„  +  2z,)  >  H0(ha  +  2 hq)  >  a. 

A  similar  argument  shows  that  if  g  <  .5,  then  $(za  + 
2 Zq)  <  Ho[ha  +  2 hq)  <  a,  and  the  corresponding  results 
hold  for  the  upper  quantile.  The  above  proof  requires 
that  ha  +  2hq  <  0.  This  will  be  the  case  unless  the  bias 
in  is  so  large  that  g  is  near  1  -  a. 

The  numbers  in  Table  1  show  the  amount  of  bias 
correction  (that  is  ( H0(ha  +  2 hq),  H0(hi-a  +  2hq))  for 
the  normal,  logistic  and  the  cauchy  distributions,  when 

a  =  .05. 


Table  1 


q 

Normal 

Logistic 

Cauchy 

.40 

(.015,  .869) 

(.023,  .884) 

(.045, 

.944) 

.45 

(.027,  .916) 

(.034,  .927) 

(.050, 

.950) 

.55 

(.084,  .973) 

(.073,  .966) 

(.050, 

.950) 

.60 

(.131,  .985) 

(.106,  .977) 

(.056, 

.955) 

The  choice  of  a  symmetric  pivotal  distribution  ap¬ 
pears  to  make  little  difference.  The  effect  of  an  ossymetric 
pivotal  distribution,  however,  can  be  large,  as  Example  1 
will  show. 


5.3.3  Another  Justification  for  the  Bias-Corrected 
Interval 

In  place  of  A4  and  A5,  we  could  assume 

h(9  -  9)  ~  H  (A6) 

and 

h{9 *  -  90h.)  i  H  (A7) 

with  H  symmetric,  and  h  increasing  and  anti-symmetric 
(h(-z)  =  -k(z)).  Letting  H  be  a  location-scale  fam¬ 
ily,  we  again  obtain  the  bias-corrected  percentile  interval 
(13)  .  When  H  is  symmetric  around  0,  9  -  9  is  symmet¬ 
ric  around  0  and  the  interval  reduces  to  the  percentile 
interval. 

Finally,  we  could  replace  h(9-9)  and  h(9’  -  9oi,)  by 
h(9/9)  and  h(§’/90>„)  respectively,  with  h( l/x)  =  -h{ x), 
and  again  the  bias-corrected  interval  emerges. 

5.4.  The  BCa  and  BC°  intervals 

Efron  (1985)  proposed  a  further  modification  of  the 
Percentile  interval  called  the  BC„  interval  (“a”  for  accel¬ 
eration).  It  assumes 

9(0)  -  9(0)  ~  N(b(  1  +  9(0)),  (1  +  as(S))1)  (15) 

This  generalizes  the  BC  interval  by  introducing  the  ac¬ 
celeration  constant  “a”  that  allows  the  variance  on  the 
transformed  scale  to  be  non-constant,  “a”  is  estimated 
by  from  a  formula  involving  the  jackknife  values  of  9. 
Efron  proves  that  the  one-parameter  version  of  the  BC„ 
interval  is  correct  up  to  Op(l/n)  under  regularity  condi¬ 
tions. 

DiCiccio  and  Tibshirani  (1985)  studied  the  BC„  pro¬ 
cedure  and  provided  a  method  for  constructing  the  trans¬ 
formation  g(.)  in  (15).  The  constructed  g(.)  is  a  variance 
stabilizing  transformation  followed  by  a  skewness  reduc¬ 
ing  transformation.  Using  this  $(.),  one  can  construct 
a  confidence  interval,  called  the  BC°  interval,  without 
computing  the  bootstrap  distribution  of  9'  \  through  the 

use  of  an  approximation  for  b  due  to  Efron  and  T.  Hes- 
terberg,  no  bootstrap  sampling  is  required,  and  just  n  +  2 
evaluations  of  the  statistic  9  are  needed. 
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6.  Comparison  Between  the  Bootstrap 
Pivotal  and  Percentile  Intervals. 

The  bootstrap  pivotal  and  percentile  intervals  dif¬ 
fer  in  their  assumptions.  In  constructing  the  bootstrap 
pivotal  interval,  we  had  to  specify  the  exact  form  of  the 
pivotal  but  we  assumed  nothing  about  its  distribution. 

On  the  other  hand,  in  building  the  percentile  interval, 
knowledge  of  the  exact  form  of  the  pivotal  was  not  neces¬ 
sary  but  we  did  require  that  its  distribution  be  symmetric 
around  0.  For  the  bias-corrected  percentile  interval,  we 
weakened  that  assumption  to  one  of  symmetry  around 
any  point,  but  we  paid  a  price:  it  was  necessary  to  spec¬ 
ify  a  distribution  for  the  pivotal. 

Which  of  these  intervals  is  better  depends  on  the 
problem.  It  is  helpful  to  look  a  few  simple  examples.  In 
each  case,  the  data  are  assumed  to  be  Gaussian. 

•  The  Mean:  9  =  E(X),  variance  known.  The  boot¬ 
strap  pivotal  interval  based  on  9-9  and  the  per¬ 
centile  interval  will  give  very  similar  results,  and 
both  will  have  approximately  the  right  coverage. 

•  The  Correlation  Coefficient:  X  =  ( Y,Z )  and  9  — 
E(Y-E(Y))(Z-E(Z))/{E(Y-E{Y))iE(Z-E(Z))t}1/i. 
The  random  variable  tanh_10  -  tanh~10  is  approx¬ 
imately  >/(fl/(2(n  -  3),  l/(n  -  3)).  Hence  the  boot¬ 
strap  pivotal  interval  based  on  t(9,9)  =  tanh~l0  - 
tanh~l9  and  the  bias-corrected  percentile  interval 
(using  the  normal  family)  both  should  work  well. 

The  uncorrected  percentile  interval  will  be  biassed. 

•  The  Variance:  9  =  E{X  -  £(X))2.  The  random 
variable  9/9  is  xH-l>  hence  the  bootstrap  pivotal 
based  on  t(9,0)  —  log  6  -  log  9  will  have  approxi¬ 
mately  the  right  coverage.  The  distribution  log  x2  is 
not  symmetric,  however,  so  the  percentile  intervals 

may  not  work  well  (see  Example  1) .  It  is  clear  that  a 
transformation  to  a  symmetric  pivotal  doesn’t  exist 
here  since  such  a  transformation  must  also  remove 
the  dependence  of  the  variance  on  9.  A  simple  delta 
method  calculation  shows  that  only  g(9)  =  log  9 
achieves  this. 

The  above  examples  represent  some  of  the  problems 
that  are  well  understood.  In  most  situations,  however, 
matters  are  much  more  difficult.  To  construct  a  boot¬ 
strap  pivotal  interval,  we  first  need  to  specify  a  quantity 
t(X,9)  that  is  approximately  pivotal.  This  alone  is  a 
difficult  task  unless  we  know  something  about  the  under- 


yling  distribution.  Now  suppose  we  are  able  to  specify 
a  pivotal  <(X,9).  Then  if  t(Jf,fi)  ~  H  and  t(X’,9,t.)  ~ 

*  H,  the  resulting  interval  will  have  the  correct  cover¬ 
age.  In  some  problems,  however,  the  bootstrap  distri¬ 
bution  of  t(X"  ,9 0j,)  can  be  a  poor  approximation  to  H . 
One  such  example  is  the  following.  Consider  the  situa¬ 
tion  Xi,  Xj,  ...Xu  ~  e-1-*  for  x  >  -1.  The  bootstrap 
pivotal  interval  for  9  =  E(X)  based  on  X  -  0  has  poor 
coverage  because  the  distribution  of  X"  —  X0t,  is  not  a 
good  approximation  to  the  distribution  of  X  -  9.  This 
is  because  the  high  positive  correlation  between  X  and 
the  sample  standard  deviation  S'  causes  underestimation 
of  the  scale  when  z  is  smaller  than  9  and  overestimation 
of  the  scale  when  z  is  greater  than  9.  Basing  the  interval 
on  (X  -  9)/S  alleviates  this  problem  and  the  resultant 
interval  has  good  coverage. 

7.  Non-parametric  profile  likelihood  in¬ 
tervals. 

A  different  approach  to  constructing  non-parametric 
confidence  intervals  can  be  developed  through  the  use  of 
an  approximate  profile  likelihood.  We  will  first  review 
the  profile  likelihood  then  show  how  it  can  be  used  in 
this  setting.  Suppose  the  true  distribution  is  a  member 
of  a  parametric  family  of  density  functions  fa  where  i } 
is  an  unknown  i-vector  of  parameters  lying  in  a  subset 
F  6  Rk.  Our  interest  focusses  on  a  real  valued  parameter 

9  =  t( i}).  Let  /(ij,y)  be  the  log-likelihood  of  the  data. 

The  profile  likelihood  for  9  is  constructed  as  follows. 
For  each  9q,  let  r>(0o)  maximize  /(ij,y)  subject  to  t(i|)  = 
9o,  and  let  t)  be  the  global  maximum  likelihood  estimator. 
Then  the  profile  (log)  likelihood  is  defined  by 

pl(9)  =  l(i(9),y)  -  f(q,y)  (16) 

We  will  assume  that  for  each  0o>  there  is  a  unique  re¬ 
stricted  maximum  ij{$o)  and  hence  r/(0)  forms  a  one- 
dimensional  curve  in  I\  We  will  call  rj(9)  the  profile  like¬ 
lihood  family.  Now  let  R(9)  be  the  signed  square  root  of 
twice  the  profile  log  likelihood  statistic: 

ft(tf)  =  ±[2(pf(9)  -  pfW)]17’  (17) 

the  sign  of  R(0)  taken  to  be  the  sign  of  9-0.  Let 
h(i>)  =  Erf  R(9) .  Then  a  second  order  correct  confidence 
interval  can  be  constructed  by  treating  the  pivotal  quan¬ 
tity  /?(0)  -  bfa)  as  N( 0, 1). 
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Consider  now  applying  this  to  the  non-parametric 
problem.  In  the  spirit  of  the  bootstrap,  we  consider 
the  family  of  distributions  to  be  multinomial  with  Hi  - 
log{Prob(X  =  *<)),  the  natural  parameters.  It  turns  out 
that  pl{8)  is  very  difficult  to  compute,  except  for  simple 
(linear)  statistics.  Hence  we  consider  an  approximation  to 
p/(0).  A  convenient  approach  is  to  construct  a  linear  ap¬ 
proximation  q(f)  to  17(0)  at  g,  then  form  the  approximate 
profile  likelihood  p/(tf)  =  l(«j(0),v)-  The  approximate 
family  pt[8)  is  actually  Stein’s  "least  favourable  family* 
(Stein  1956).  Given  this  approximate  profile  likelihood 
(which  turns  out  to  be  easy  to  compute),  we  proceed  as 
above,  forming  the  pivot  B(0)  -  6(ty)  and  inverting  to  find 
the  confidence  interval.  One  can  show  that  this  approx¬ 
imate  interval  still  produces  a  second  order  correct  in¬ 
terval.  Note  that  computation  of  6(ij)  in  the  multinomial 
requires  bootstrap  sampling,  analogous  to  the  calculation 
of  6  earlier.  This  method  produces  not  only  a  confidence 
interval  for  8  but  also  an  approximate  non-parametric 
profile  likelihood.  For  more  details  on  this  approach  we 
refer  the  reader  to  DiCiccio  and  Tibshirani  (1986). 

8.  An  Example. 

Table  2  illustrates  the  various  confidence  procedures 
for  a  familiar  problem.  The  data  x\,xj,...xn  are  i.i.d 
N(0,1).  The  parameter  of  interest  is  8  —  Var(xi).  Level 
1 — 2t»  confidence  intervals  are  to  be  based  on  the  unbiased 
estimate  8  =  £"(x<  “  *)*/(*»  ~  1)  Tbe  «  waa 

taken  to  be  20  and  a  =  .05.  The  exact  interval  is  based  on 
inverting  the  pivotal  8/8  around  its  chi-squared  (n  —  1) 
distribution.  The  standard  interval  is  of  the  form  (16) 
with  o  -  f(Jt/i»)‘/*  the  estimated  asymptotic  standard 
error  of  9.  The  bootstrap  pivotal  interval  is  based  on 
the  pivotal  8/9.  The  lower  and  upper  values  in  Table  2 
refer  to  averages  over  300  monte  carlo  simulations  of  the 
intervals.  The  level  column  indicates  the  proportion  of 
trials  in  which  each  interval  didn’t  contain  the  true  value 
8=1. 


Table  2. 

Confidence  intervals  for  the  variance 


Interval  Average 

left  Average  right  Level 

Exact 

.630 

1.878 

10.0 

Standard 

.466 

1.531 

11.0 

Bootstrap  Pivotal 

.670 

1.860 

15.7 

Percentile 

.484 

1.363 

24.3 

BC 

.592 

1.467 

19.3 

BCa 

.617 

1.524 

19.3 

BC°t 

.633 

1.540 

18.7 

NP  Prof  Lik 

.615 

1.579 

18.9 

The  standard  interval  overcovers  on  the  left  and  un¬ 
dercovers  on  the  right  so  that  the  overall  coverage  is  about 
right.  This  illustrates  why  coverage  alone  is  not  a  good 
way  to  assess  confidence  intervals.  The  bootstrap  pivotal 
interval  does  fairly  welt,  while  the  others  display  too  low 
coverage.  The  percentile  interval  is  especially  poor.  The 
BC„,  BC%  and  non-parametric  profile  likelihood  inter¬ 
vals  capture  the  asymmetry  of  the  normal  interval  better 

than  the  percentile  interval  but  still  underestimate  the 
right  hand  endpoint. 


9.  Closing  Remarks. 

We  have  discussed  a  number  of  bootstrap  techniques 
for  constructing  confidence  intervals.  All  are  potentially 
useful  as  data-analytic  tools  because  they  are  non-parametric 
and  can  be  applied  in  complex  situations.  Further  work 
is  needed  to  evaluate  and  improve  these  methods.  Our 
current  research  focusses  the  non-parametric  profile  like¬ 
lihood  interval. 
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/(P.-Y)  =  2  PiKOhT).  (1.1) 

i  =  l 

Often  t||  is  a  linear  form:  ti,(0)  =  x/p.  but  our 
approach  allows  n,  to  be  an  arbitrary  twice  dif¬ 
ferentiable  function.  Similarly,  p,  can  be  an 
arbitrary  twice  differentiable  function;  p,  or  t), 
can  absorb  any  relevant  response  data  y,.  (The 
choice  of  p;  and  tj,  in  (1.1)  is  far  from  unique, 
but  we  do  not  have  space  here  to  consider  the 
freedom  (1.1)  allows.) 

2.  Examples 

Our  framework  includes  both  linear  and 


nonlinear  versions  of  the  generalized  linear 
ABSTRACT  models  of  Nelder  and  Wedderbum  [NelW72], 

Wedderburn’s  [Wed74]  quasi-likelihood  models, 
Linear  and  nonlinear  exponential  and  the  extended  quasi-likelihood  models  of 

family  and  quasi-likelihood  regression  Nelder  and  Pregibon  [NelP86].  Some  examples 

models  form  a  class  of  models  exhibiting  follow. 

a  common  structure  that  invites  using  Least  Squares:  q  =  0,  p,(-n,(0),y) 

one  algorithmic  framework  to  compute  p,(-n,(0))  =  (y,  -  %(0))2,  where  q,  is  often  a 

parameter  estimates  and  regression  diag-  nonlinear  function  of  0.  (We  could  include  the 

nostics  for  all  members  in  the  class.  variance  cr  by  taking  q  =  1,  y  ~  a,  and 

This  framework  extends  our  work  on  p,(-n,(P)»7)  =  (y,  ~  t),(0))2/ct2  +  log(o2),  but 

nonlinear  least  squares;  it  includes  itera-  it  is  slightly  more  efficient  to  estimate  0  and  a 

tively  reweighted  least  squares,  but  also  separately.) 

encompasses  secant  updates  for  a  piece 
of  the  Hessian  matrix  of  the  likelihood 
or  quasi-likelihood  function  along  with 
adaptive  decisions  about  when  to  use 
this  information.  The  framework  also 
provides  much  of  the  machinery  needed 
to  compute  “leave  one  out”-style  regres¬ 
sion  diagnostics.  We  describe  the  in  which  T[  and  t2  are  tolerances  that  must  be 


Huber: 


P«(th)  = 


q  =  l, 

(yi-v)2  .  hi  -  t\  I  _ 

- - -  +  j2y  if - <  T| 

2y  y 

T)  y  ’ 

Tr(|y;-Til - ~)  +  T2"y  otherwise 


framework,  discuss  some  implementa¬ 
tion  details,  and  present  some  numerical 
experience. 

1.  Introduction 

Parametric  regression  models  involve  a  vec¬ 
tor  of  structural  parameters,  0  €  1RP,  and  a 
(possibly  empty)  vector  of  “nuisance”  parame¬ 
ters,  y  C  R4.  Computing  a  parameter  estimate 
often  reduces  to  minimizing  an  objective  func¬ 
tion  of  the  form 


properly  chosen;  y  is  the  scale  parameter  o  in 
§7.8  of  [Hub8 1  ] .  Many  other  robust  regression 
problems  (e.g.  those  described  in  [HolW77])  are 
also  covered  by  our  framework. 

Poisson  (with  q  =  log  (mean)):  q  =  0, 

P/(t|)  =  c,exp(Ti)  -  y,T|,  where  a  total  of  yt 
counts  are  observed  in  c,  replications  of  the  ith 
set  of  experimental  conditions. 

Binomial:  9  =  0, 

Pi('n)  =  -y,log(q)  -  (c,  -y,)log(l  -  Ti),  where 

♦Research  supported  in  part  by  National  Science 
Foundation  Grant  DCR-81 16778  and  Army 
Research  Office  Grant  DAAG29-84-K-0207. 


there  are  y,  successes  in  c,  tries  under  the  ith  set 
of  conditions. 

Quasi-likelihood  with  variance  VaGi)  =  p.e 
(p.  —  mean )  [NelP86]:  q  =  2,  y  =  f^j, 

Pi(Tl>'y)  =  ylog(27T<J>y?)  +  (2.1) 

(yl-»  -  V-e)y,-  ,  n2'*-??-* 

<j>(  I  -  6)  «j>(2  —  0) 

(L'Hopital’s  rule  gives  special  forms  for  p,  when 
0  =  1  or  2.) 


respect  to  the  nuisance  parameters  y.  The  gra¬ 
dient  components  are 

VO,7)  =  £  ^T-(*hO),7)-  (3.3) 

dy 

The  Hessian  components  are 


=  JTp'^, 


(3.4a) 


3.  Derivatives 

The  derivatives  of  (1.1)  with  respect  to  the 
structural  parameters  p  have  a  form  that  often 
is  worth  exploiting.  It  seems  essential  to  com¬ 
pute  the  gradient  V/  =  f  reasonably 

well.  Its  structural  piece  Vp/has  the  form 

Vp/Oi'Y)  =  yTpi,(‘n,7).  (3.1a) 

where  J  £  R"xf  is  the  Jacobian  matrix 


th), 

-  7p7<P) 


(3.1b) 


and  p^,  is  the  vector 


(p;)i  =  (P^(t|0))),  =  -^-(ti1(P),7)(3.1c) 


The  structural  part  Vgn  of  the  Hessian  matrix 


■  r  pp  1 

VVO.H)  -  has  the  form 

Vfo/O.y)  =  JT{p"(^y)}J  + 

+  2  (p-n)iv2'n/(P>- 

1  =  1 

where  (p")  is  the  diagonal  matrix 

(p"(7],y))  =  (p"(ti(P),7)) 


(3.2a) 


(3.2b) 


=  diag(  -j- ,  ■  ■  •  ,  -~) 

Just  as  in  the  nonlinear  least-squares  case 
[DenGW81],  the  information  needed  to  compute 
the  gradient  furnishes  an  important  component 
of  the  Hessian,  i.e.  the  J  in  JT{p")J. 

Consider  now  the  derivatives  of  /  with 


where  p^  is  the  nxq  matrix  whose  ith  row  is 

02p,  02p,  02p, 

— ^-(7],0),7)  =  ’  7T~>; 

07]0y  07]07| 

and 

V27/0,y)  =  £  -MrOl/(0).7)  €  R**9-  (3.4b) 

,  =  1  dy 

0Pl  02p| 

The  relevant  partial  derivatives  ( - ,  - , 

0y  0y07] 

02p, 

and  — — )  are  often  easy  to  compute,  and  we 

dyz 

assume  they  are  available,  so  we  may  compute 
and  V2^/ directly. 

4.  Approximating  the  Mess 

It  seems  relevant  to  ask  how  well  the  tech¬ 
niques  that  we  found  helpful  in  [DenGW81]  for 
solving  nonlinear  least-squares  problems  carry 
over  to  the  more  general  parameter  estimation 
problems  of  concern  here.  One  of  the  key  ideas 
in  [DenGW81]  is  use  of  a  secant  -  update  to 

approximate  the  messy  part  of  (3.2a),  i.e.  the 

n  - 

sum  of  little  Hessians,  2)  (Pti)f^m(P)*  Un 

1  =  1 

talks,  Schnabel  sometimes  calls  this  “the  mess 
matrix”.]  Of  course,  on  some  problems 
7]/( (3)  =  *7(3  is  linear,  in  which  case  the  messy 
sum  vanishes.  But  we  wish  to  allow  7](P)  to  be 
nonlinear.  Thus  we  are  led  to  considering  Hes- 

H  M 

sian  approximations  H  =  ~  V2/ 

[W7P  Hiy\ 

in  which  Hpp  has  the  form 

ffpp  =  Hgn  +  5.  (4.1) 

Here  Hqn  (the  “Gauss-Newton”  part  of  the 


Hessian)  is  the  part  of  V  ^p/  that  we  can  easily 
compute,  and  S  ~  V^p/  -  Hgn  *s  a  matrix  that 
we  update  after  taking  a  step.  A  straightfor¬ 
ward  generalization  of  [DenGW81]  considered 
in  [Gay80]  is  to  use 

Hgn  =  (4.2) 

but  below  we  also  consider  an  alternative  based 
on  the  expected  value  of  JT{p")j. 

In  the  process  of  stepping  from  the  current 
iterate  1{M  to  the  next  iterate 

y+  =  Py]  +  [a?]*  we  ,earn  (aPProxi- 

mate  y)  how  S  should  look  in  the  step  direction 
[$  -  Thus  we  determine  a  vector  'F  such  that 
S  + ,  the  new  5  matrix,  should  satisfy 

S  +  A3  =  'P.  (4.3) 

Many  choices  of  'If  are  possible  —  we  con¬ 
sidered  half  a  dozen  choices,  including  those  in 
[DenW78],  in  the  work  leading  to  NL2SOL 
[DenGW81]  —  but  analogy  with  that  work  sug¬ 
gests  the  following  choice  of  when 

Hgn  ~  7T{p"/7.  In  this  case  we  wish  to  have 

s+Ap  ~  2  (P;o+,7+)),v2n.O+)A3.  But 
V2t,/(P+)AP  ~  Vt,,(P+)  -  V^Cp),  and 
i(p;0+,7+))t  (Vt,,(p+)  -  VT|f(P))T 

=  ( J+  -  J)T p;(T!(P+),7+), 
so  we  are  led  to  the  choice 

'P  =  (J  +  -  y)Tp^(-r,(p+),7+)  (4.4) 

of  'P  in  (4.3). 

It  seems  reasonable  to  use  some  kind  of 
least-change  secant  update  [DenS79]  to  update  S\ 
the  general  idea  is  that  in  some  sense  we  should 
make  S+  —  5  as  small  as  possible,  subject  to 
(4.3).  The  specific  update  suggested  by  Fletcher 
and  Al-Baali  [FleA 85]  is  the  best  one  we  have 
seen  for  nonlinear  least-squares  problems,  and 
its  extension  to  the  present  context  is  the  one 
used  in  the  computing  reported  below. 


5.  Adaptive  Modeling 

Occasionally  it  is  useful  to  initialize  S  by 
finite  differences,  but  usually  we  just  start  with 
S  =  0  (the  pxp  matrix  of  zeros);  we  always 
start  with  S  =  0  in  the  computing  reported 
below. 

In  the  early  iterations,  S  may  contribute  lit¬ 
tle  to  computing  good  steps  |^J.  Moreover, 

as  noted  above,  choosing  S  =  0  is  appropriate 
on  some  problems.  Thus  it  is  useful  to  adap¬ 
tively  decide  whether  to  include  S  in  the  optimi¬ 
zation  algorithm’s  model  of  its  objective  func¬ 
tion.  We  do  this  as  in  [DenGW81],  (We  also 
“size”  5  as  in  [DenGW81].) 

6.  IRLS  choice  of  Hgn 

Under  appropriate  assumptions,  Wp  has 
expected  value  JT(wIRLS)j,  where  wIRL*  is  the 
weighting  vector  in  the  iteratively  reweighted 
least-squares  algorithm  suggested  in  [NelW72] 
and  [Wed74]  (see  also  §§1.4  and  2.5  of 
tMcCN83]).  Thus  we  are  led  to  an  alternate 
choice  of  Hgn,  namely 

Hgn  -  JT(w,RLS)j.  (6.1) 

Both  choices  of  Hgn  have  the  form 
Hgn  ~  yT(w)y,  where  w  might  be  p"  or  wIRLS . 
Correspondingly,  (4.4)  generalizes  to 

*  =  (J+  ~  y)TP^(ti(p+),y+)  + 

+  J  +  { p"  -  w)j  +  Ap. 

7.  Trust-Region  Steps 

Some  kind  of  step-size  control  is  often 
needed  to  expand  the  region  of  convergence  of  a 
locally  convergent  iteration.  In  optimization 
algorithms,  one  often  exercises  step-size  control 
by  doing  an  approximate  line  search:  looking  at 
candidate  next  iterates  on  a  (straight  or  curvi¬ 
linear)  search  path  until  an  acceptable  one  is 
found.  We  like  using  “trust-region”  techniques 
for  this  purpose.  The  general  idea  is  that  we 
have  an  objective  function  /  (a)  whose  behavior 
near  the  current  iterate  a  we  approximate  by  a 
mode)  function  f®(  8),  so  that 

/(a  +  8)  ~  fQ(b).  (In  the  present  context 
is  a  quadratic  form. 


I 


/®(  8)  =  /(a)  +  8TV/(a)  +  y8T//8  with 

H  ~  V2/( a),  but  people  sometimes  use  other 
local  models,  e.g.  conic  models  [Dav80], 
[Sor80],  [Gra84].)  The  approximation 

/^(8)  ~  /(a  +  8)  is  generally  good  only  for 
small  j|S)j,  so  we  maintain  a  bound  £  on  the  set 
of  8  values  {8:  ||8||  ^  £}  for  which  we  deem  this 
approximation  reliable.  We  choose  a  trial  step 
h‘nal  that  approximately  minimizes  /^(8)  subject 
to  the  constraint  ||Sj|  ^  £.  If  we  use  the  norm 
||8||  :=  ||D8||2,  where  D  is  a  positive-definite 
diagonal  matrix  and  [|  ■  |]  ?  is  the  standard 
Euclidean  norm  (||x||  =  Vjctjc),  if  is  the 
quadratic  model  shown  above,  and  if  b'nal 
exactly  minimizes  8)  subject  to  ||8||  <  £, 
then  bmal  satisfies 

(//  +  \D2)b,rial  =  -Vf  (7.1) 

for  some  Lagrange  multiplier  X  ^  0  that  renders 
H  +  AD2  positive  semidefinite.  We  end  up 
with  an  iteration  much  akin  to  the  Levenberg- 
Marquardt  iteration,  except  that  £  controls  X 
rather  than  vice-versa.  If  the  step  thus  com¬ 
puted  fails  to  give  good  agreement  between 
and  /,  then  we  reduce  £  and  try  again,  thus 
effectively  performing  a  curvilinear  line  search. 
Otherwise  we  may  accept  a  +  hlnal  as  the  next 
iterate  (or  may  increase  £  and  try  again); 
[DenGW81]  explains  the  specific  rules  used  in 
the  computing  described  below.  (For  more 
details  on  matters  related  to  (7.1),  also  see 
[Mor78j,  [Gay81],  [Gay83],  [MorS83]  and  the 
books  (DenS83],  [GilMW81].) 

Sometimes  an  automatic  choice  of  the  scal¬ 
ing  matrix  D  in  (7.1)  is  useful,  e.g.  relating  D„ 
to  some  norm  of  the  t'th  column  of  J.  In  the 
computing  reported  below,  we  considered  both 
the  fixed  choice  D  =  /  and  an  adaptive  choice  in 
which  Da,  1  <  i  <  p,  is  based  on  | //,-,■  | ,  where 
Hjj  =  \(Hgn)h\  +  max{S„,  0}  for  the  structural 
parameters,  i.e.,  1  <  i  <  p,  and  H„  =  (V2f)u 
for  the  nuisance  parameters,  i.e.  p  <  i  <  p+q. 
The  update  rules  for  D  are  analogous  to  those  in 
[DenGW81]: 

max{0.6  D„,  |W„|}  if  \HU\  >  1(T6 
max{0.6 •£>„,  1}  otherwise 

None  of  the  problems  considered  below  has  wild 


scaling,  and  the  fixed  choice  D  =  I  usually 
worked  better  for  them.  See  Table  3  below 
(§10)  for  full  details. 

8.  Regression  Diagnostic  Hooks 

Our  current  implementation  of  the  algo¬ 
rithm  sketched  above  provides  hooks  for 
“leave-one-out”  regression  diagnostics.  The 
idea  is  to  provide  a  quick  indication  of  which 
observations  wield  the  most  influence  on  the 
parameter  estimate.  To  do  this,  once  we  have 
found  optimal  parameter  estimates  (3*,y*),  we 
approximate  the  Hessian  matrix  V2/( P*,y*)  by 
finite  differences.  Then  for  each  i ,  we  let  fl) 
denote  /  with  the  ith  observation  deleted;  we 
approximate  by 

Vpp/(P*-'Y*)  -  p/0*,-y*)Vt|i V-ri7 »  estimate  the 
parameters  (|3*(')^-y*<'))  by 

(p*<i>,7*<0)  K 

~  (3*,7*)  -  (V2/'V,v/0)((i*,y*), 

and  point  a  finger  at  the  i  values  for  which 
0*(')  ,'Y*(|>)  and  (P*,y*)  differ  sufficiently. 
This  can  give  diagnostics  analogous  to  those  in 
[BelKW80],  [Pre79],  [Pre81],  and  [Wel82], 

9.  Test  Results 

We  have  run  tests  with  16  problems  that  are 
summarized  in  Table  1  and  described  more  fully 
in  §10.  For  those  problems  where  either  -n  or 
log(n)  is  linear,  we  used  the  weighted  least- 
squares  calculation  shown  in  [Fro81]  to  compute 
the  initial  guesses  (3°,"y°);  otherwise  we  used 
the  initial  guesses  shown  in  Table  2  (which,  if 
possible,  included  the  ones  from  the  problem 
sources). 

We  used  the  stopping  tests  and  tolerances 
described  in  [Gay83]  (i.e.,  the  stopping  tests  of 
[DenGW81]  with  tolerances  appropriate  to  the 
double-precision  VAX^arithmetic  we  used). 

For  seven  of  the  problems  ti  is  linear 
(•q,(P)  =  3T*;).  so  (4.2)  with  S  =  0  makes 
H  =  V2/,  and  we  are  doing  Newton’s  method 
(with  step-size  control).  These  problems 
required  between  three  and  thirteen  iterations 
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(between  four  and  fourteen  function  and  gra¬ 
dient  evaluations). 

As  mentioned  in  §7  above,  choosing  D  =  / 
(in  (7.1))  usually  worked  better  than  updating 
D.  In  30  pairs  of  runs,  D  =  /  was  better  19 
times,  worse  five  times.  The  differences  were 
often  minor,  but  were  dramatic  for  problem 
mh202,  as  Table  3  shows. 

To  see  how  useful  using  our  secant  approxi¬ 
mation  to  the  messy  part  of  the  Hessian  is,  we 
reran  the  test  program  with  adaptive  modeling 
turned  off,  i.e.,  only  using  H  :=  Hqh  as  the 
Hessian  approximation.  Since  S  =  0  when  is 
linear  and  HGN  is  given  by  (4.2),  this  could 
affect  only  44  of  the  60  pairs  of  runs  summar¬ 
ized  in  Table  3.  As  Table  3  shows,  adaptively 
using  S  helped  on  13  of  these  runs  and  hurt  on 
three  of  them;  the  degradations  caused  by  using 
S  were  minor,  but  the  improvements  were  some¬ 
times  substantial,  e.g.  on  problems  e2.8  and  tex¬ 
tile. 

On  four  of  our  test  problems,  the  p"  and 
IRLS  choices  of  HGn,  (4.2)  and  (6.1),  respec¬ 
tively,  are  the  same.  On  the  remaining  prob¬ 
lems,  (6.1)  was  better  than  (4.2)  on  eight,  worse 
on  seven  of  the  runs  using  S  summarized  in 
Table  3. 

10.  Test  Problem  Details 

Table  1  gives  details  of  the  test  problems  we 
used.  In  the  following  formulae,  c,,  «/,  Xjj  and 
y,  denote  data  (carried  by  t|/  or  p,),  and 
Xj  :  =  (jc  i ,  ■  ■  ■  ,  j tp)T;  Cj  and  n,  sometimes 
denote  replication  counts  or  batch  sizes,  as  in 
[Fro84],  The  choices  of  tj(0)  include: 
linear: 


T)/(P)  =  *7(3; 

log  linear: 

t),(0)  =  exp(x^P); 

logistic  of  linear: 

*1/(0)  =  (expf-x/p)  +  l]-1; 
special  forms: 

*1/0)  =  01  {02*/,  l  + 


( 10. 1) 


(10.2) 


(10.3) 


(10.4) 


T)i(0)  =  {exp(02  +  03*;,  2)  + 

+  exp(P4)}exp(P1jr,,  |); 

T),(0)  =  0ix,,,Il  -  exp(02*,,2)]P3; 

3  02/ 

%<S>  =  +  2  — ; 

j=l  Xij  +  022  +  1 


^/(P)  =  Pi  +  02log(x,  | )  + 


03*i,2 
04  +  */,  2 


(10.5) 


(10.6) 

(10.7) 


(10.8) 


+  - [1  -  03(1  -exp(-*,,2/03))l*;,|}; 

xi,  2 


%(P)  =  Pi  +  02log(*,,i  -  05)  +  (10.9) 

03*;,  2 

04  +  *i,2  ’ 

Choices  of  p/(i),Y)  include: 

Poisson  (tj  =  p,): 

P;(3),7)  =  c/T)  -  y/log(-n);  (10.10) 
Poisson  (ti  =  log(p.)): 

P/(-n.7)  =  oexp(vj)  -  y,i);  (10.11) 

binomial  (iq  =  p.): 

P/On. 7)  =  -y,log(ri)-(n,-y,)log(l-Ti);  (10.12) 
binomial  (logistic): 

P/(n.7)  =  «;log(l  +  e11)  -  y,Ti;  (10.13) 
binomial  (probit): 

P;(tH.7)  =  -^/logl^Tl)]  ~  (10.14) 

-  (n,  -  y,)logll  -  4>(ti)], 

where  is  the  cumulative  normal  distribution 
function;  and 
gamma  (t|  =  p_1): 

P;(n.7)  =  -  c,log(Tt).  (10.15) 

Problems  mn202  and  mn202.l  differ  only  in 
their  starting  guesses;  the  same  goes  for  mn205 
and  mn205.1.  For  problems  with  tj,  given  by 
(10.1)  or  (10.2),  we  computed  initial  guesses  0° 
as  for  Poisson  regression  problems  in  [Fro81]: 

0°  =  (J°T{c)j0)-lJ°Tyo, 

where  J°  =  7(0°)  and  y°,  is  y,  for  (10.1)  and 

c,log(max{  — ,  ~ })  for  (10.2).  |We  do  not 
c,  2c, 


explicitly  form  J°T(c)J°.]  Table  2  gives  the  ini¬ 
tial  we  used  for  other  choices  of  tj,  . 

(The  starting  guess  for  e3.3  is  the  same,  to 
six  figures,  as  for  e3.1.) 

Table  3  summarizes  our  test  runs;  NF  and 
NG  stand  for  the  number  of  function  and  gra¬ 
dient  evaluations,  respectively. 
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AN  EFFICIENT  ALGORITHM  FOR 
ORTHOGONAL  DISTANCE  DATA  FITTING 

Paul  T.  Boggs,  National  Bureau  of  Standards 
Richard  H.  Byrd,  University  of  Colorado 
Robert  B.  Schnabel,  University  of  Colorado 


Abstract.  One  of  the  most  widely  used  methodologies  in 
scientific  and  engineering  research  is  the  fitting  of  equa¬ 
tions  to  data  by  least  squares.  In  cases  where  signif¬ 
icant  observation  errors  exist  in  all  data  (independent) 
variables,  however,  the  ordinary  least  squares  approach, 
where  all  errors  are  attributed  to  the  observation  (de¬ 
pendent)  variable,  is  often  inappropriate.  An  alternate 
approach,  suggested  by  several  researchers,  involves  min¬ 
imizing  the  sum  of  squared  orthogonal  distances  between 
each  data  point  and  the  curve  described  by  the  model 
equation.  We  refer  to  this  as  orthogonal  distance  regres¬ 
sion  (ODR).  This  paper  describes  a  method  for  solving 
the  orthogonal  distance  regression  problem  that  is  a  di¬ 
rect  analog  of  the  trust  region  Levenberg-Marquardt  algo¬ 
rithm.  The  number  of  unknowns  involved  is  the  number 
of  model  parameters  plus  the  number  of  data  points,  often 
a  very  large  number.  By  exploiting  sparsity,  however,  our 
algorithm  has  a  computational  effort  per  step  which  is  of 
the  same  order  as  required  for  the  Levenberg-Marquardt 
method  for  ordinary  least  squares.  We  summarize  the 
theoretical  properties  of  our  algorithm,  and  provide  the 
results  of  computational  tests  that  illustrate  some  differ¬ 
ences  between  the  two  approaches. 

1.  Introduction 

The  problem  of  fitting  a  model  to  data  with  errors 
in  the  observations  has  a  rich  history  and  a  considerable 
literature.  The  problem  where  there  are  also  errors  in 
the  independent  variables  at  which  these  observations  are 
made,  however,  has  only  relatively  recently  been  given  at¬ 
tention.  In  this  paper,  we  consider  a  general  form  of  this 
extended  problem  and  provide  an  efficient  and  stable  al¬ 
gorithm  for  its  solution.  Several  names  for  this  extended 
problem  have  been  suggested;  we  prefer  orthogonal  dis¬ 
tance  regression  (ODR). 

Errors  in  independent  variables  virtually  always  oc¬ 
cur,  but  are  often  ignored  in  order  that  classical  or  ordi¬ 
nary  (linear  or  nonlinear)  least  squares  (OLS)  techniques 
can  be  applied  (see,  e.g.,  |LawH74],  [Ste73],  [Mor77], 
[DenGW81]).  Also,  if  these  errors  are  small  with  respect 
to  those  in  the  observed  variables,  then  ignoring  them 
does  not  usually  seriously  degrade  the  accuracy  of  the  es¬ 
timates.  In  some  fields,  however,  measurement  techniques 
are  sufficiently  accurate  that  errors  in  the  independent 
variables  are  not  insignificant  compared  to  those  in  the 
observations.  Examples  at  the  National  Bureau  of  Stan¬ 
dards  (NBS)  include  the  calibration  of  electronic  devices, 
flow- meters  and  calorimeters.  Another  class  of  examples 
comes  from  curve  and  surface  fitting  problems. 

We  first  develop  a  formal  statement  of  the  ODR  es¬ 
timation  problem  and  briefly  discuss  its  application  to 
statistical  estimation  and  to  curve  fitting.  The  derivation 


and  convergence  analysis  of  a  highly  efficient  algorithm  for 
solving  ODR  problems  is  summarized  in  Sections  2  and  3. 
In  Section  4,  the  results  of  some  computations  are  shown 
which  illustrate  the  performance  of  the  algorithm  and  al¬ 
low  some  comparisons  with  ordinary  least  squares.  This 
material  is  presented  in  greater  detail  in  Boggs,  Byrd,  and 
Schnabel  [BogBS85]. 

Observations  in  applied  science  are  often  thought  of  as 
satisfying  a  mathematical  model  of  the  form 

(11)  y  =  /(*>£) 

where  y  is  taken  to  be  the  “observed”  value,  or  indepen¬ 
dent  variable;  and  0  6  Rp  is  the  set  of  parameters  to  be 
estimated.  The  function  /  is  not  assumed  to  be  linear, 
but  is  assumed  to  be  smooth.  The  data  are  simply  the 
pairs  (ii, j/,),t  =  l,...,n.  Typically  the  number  of  data 
points,  n,  is  far  greater  than  the  number  of  parameters, 
P- 

In  the  classical  case,  only  the  observations  y,  are  as¬ 
sumed  to  be  contaminated  with  errors.  If  these  errors  are 
additive  and  the  mathematical  model  is  exact  then 

(1.2)  yi  =  /(x„  0)  +  ii  i  =  1, . . .  ,n 

for  some  correct  value  of  the  parameters  0.  If  in  addition 
the  errors  are  normally  distributed  with  mean  0  and  vari¬ 
ance  a2 1,  then  maximum  likelihood  estimate  of  0  is  the 
solution  to  the  least  squares  problem 

n 

(1.3)  min^jy,  - /(x,,/?)j2. 

If  /  is  a  linear  function  of  0  then  this  is  a  classical  linear 
least  squares  problem,  otherwise  it  is  a  classical  nonlinear 
least  squares  problem.  Even  when  the  above  assumptions 
on  the  model  or  the  errors  are  not  satisfied,  problem  (1.3) 
is  the  most  frequently  used  method  for  parameter  estima¬ 
tion. 

In  the  more  general  situation,  the  measurements  of  the 
independent  variables  x,  are  also  assumed  to  contain  er¬ 
rors.  If  we  assume  that  y,  has  unknown  additive  error 
c,  and  that  x,  has  unknown  additive  error  6t,  then  (1.2) 
becomes 

(1.4)  y,  =  /(*.  +£.;/?)  + 

An  intuitively  reasonable  way  to  select  the  parameters 
in  this  case  is  to  choose  the  0  that  causes  the  sum  of  the 
squares  of  the  orthogonal  distances  from  the  data  points 
(x,,y,)  to  the  curve  f(x,0)  to  be  minimized.  If  r,  is  the 
orthogonal  distance  from  (x,,y,)  to  the  curve,  then 

rf  =  e?-(-  6*,  »  =  l,...,n 


where  <,  and  6,  solve 

min(t,J  +  £,2) 

(1.5)  ‘-4- 

subject  to  f(x,  +  6,;0)  +  f,  =  yt. 

The  constraint  in  (1.5)  ensures  that  the  distance  r, 
connects  the  point  (x,,y,)  to  the  curve.  The  minimiza¬ 
tion  ensures  that  r,  is  the  radius  of  the  smallest  circle 
centered  at  (x,,y,)  which  is  tangent  to  the  curve  /(x,;  /3). 
Therefore,  the  parameters  0  that  cause  the  sum  of  the 
squares  of  the  orthogonal  distances  from  the  data  points 
to  the  curve  to  be  minimized  are  found  by  solving 

n  n 

min  N  r2  =  min  V  (<2  +  <52) 

0,i,i  0,1,6^  v  ' 

(1.6)  ,=  1  ,=1 

subject  to  y,  =  f(x,  +  6,\0)  +  tx, 

i  =  1, ...  ,n. 

Since  the  constraints  in  (1.6)  are  simple  linear  constraints 
in  we  solve  for  e,  and  eliminate  both  these  variables 
and  all  of  the  constraints  thereby  obtaining 


n 

jin  V  [(/(*,  +  6, \3)  -  y,)2  +  6f 

3,6  * 


which  is  now  an  unconstrained  minimization  problem. 

Two  slight  extensions  to  this  form  constitute  the  ulti¬ 
mate  problem  to  be  considered.  The  first  allows  the  pos¬ 
sibility  that  x,  £  Rm  rather  than  ft1.  Therefore,  6,  6  Rm 
and  instead  of  6 12  in  (1.7)  we  have  6^6,  =  i  (The 
superscript  T  denotes  transpose.)  The  second  extension 
merely  admits  a  general  weight  ing  scheme  on  the  problem. 
The  form  we  have  chosen  results  in  the  general  nonlinear 
ODR  problem 

r* 

(ODR)  min  £  w 2  [(/(x,  +  6,\0)  y,)2  + 

1  =  1 

where  w,  >  0, i  —  l,...,n  and 

D,  -  diag{d,j  >  0,  j  =s  l, . . .  ,m}  ,  »  =  1, . . .  ,n, 

i.e.,  D,  is  a  diagonal  matrix  of  order  m.  It  follows  that 
the  vectors  y,vu  £  R"  and  x,6  g  Rnm  and  that  6,TD,62  = 
N'm  ,  62d2  . 

While  we  have  not  assumed  that  /  is  linear,  it  is  im¬ 
portant  to  note  that  (ODR)  is  a  nonlinear  optimization 
problem  even  if  /  is  the  simple  linear  function 

y  3\x  i  02 

since  we  then  have  that 

y,  t  3,(r,  -  6,)  ±  32  * 

Clearly  the  product  of  3\  and  bx  is  an  unavoidable  non¬ 
linearity. 

ODR  problems  have  been  considered  by  statisticians, 
usually  under  the  rubric  errors  in  variables.  Most  of 
this  effort,  however,  has  been  devoted  to  linear  models, 
i.e.,  when  /  is  linear  in  3.  (See  e.g.,[Mor71j,  [KenS083], 


[Bar74,p.67|  and  |Ful86].)  As  in  the  classical  nonlinear 
least  squares  case,  little  theory  on  the  statistical  proper¬ 
ties  of  the  solution  appears  to  exist.  It  is  known  that  if 
both  f  and  6  are  normally  distributed  with  mean  zero  and 
variances  <r2/  and  of/  respectively,  then  the  solution  of 
(ORD)  with  Wi  =  1  and  D ,  =  ( otfoi)l ,  i  —  1  ,...,n  is 
a  maximum  likelihood  estimate  of  the  parameters.  Un¬ 
fortunately,  as  in  the  nonlinear  classical  case,  no  gener¬ 
ally  valid,  computationally  efficient,  inferential  statistical 
tests  are  known. 

Independent  of  statistical  considerations,  ODR  has  po¬ 
tentially  significant  applications  in  curve  and  surface  fit¬ 
ting.  Consider,  for  example,  the  problem  of  finding  the 
parabola  which  best  fits  the  given  set  of  points  (We  have 
seen  this  problem  arise  from  a  dental  application.)  Here 
it  is  clear  that  ordinary  least  squares  will  unduly  weight 
the  top  data  points,  while  fitting  in  the  horizontal  direc¬ 
tion  would  undully  weight  the  bottom  data  points.  An 
orthogonal  measure  of  distance  alleviates  these  problems 
and  provides  a  reasonable  fit.  A  related  case  is  the  prob¬ 
lem  of  fitting  near  an  asymptote.  Orthogonal  distances 
here  prevent  the  undue  influence  of  points  close  to  the 
asymptote.  This  problem  is  discussed  further  in  Section  4. 

The  literature  contains  several  algorithms  for  solving 
(ODR)  and  related  problems.  For  example,  Golub  and 
Van  Loan  [GolV83]  give  a  singular  value  decomposition 
procedure  for  the  problem  when  /  is  linear.  They  refer 
to  this  problem  as  total  least  squares.  Britt  and  Luecke 
[BriL73]  consider  the  nonlinear  case  as  well  as  the  non¬ 
linear  implicit  case  and  present  an  algorithm.  Recently, 
Schwetlick  and  Tiller  [SchT85]  proposed  an  algorithm 
similar  to  the  one  here  for  the  nonlinear  problem.  Our 
algorithm,  however,  does  not  make  use  of  the  singular 
value  decomposition  and  it  does  incorporate  a  full  trust 
region  strategy. 

2.  The  Algorithm 

In  order  to  solve  the  minimization  problem  (ODR), 

n 

(2.1)  min  w2  [(/(x,  +  6,\3)  -  y,)2  +  6? £?«,] 

’  t=  1 

we  first  express  it  in  a  more  convenient  form  and  simplify 
the  notation.  Next,  we  give  an  overview  of  the  iteration 
which  is  based  on  the  trust  region  -Levenberg-Marquardt 
strategy  popularized  by  Mor^  (Mor77).  (See  also  [Heb73], 
jDenS83].)  We  then  show  how  to  modify  this  technique 
to  obtain  an  algorithm  which  requires  the  same  order  of 
work  per  iteration  as  these  algorithms  applied  to  the  same 
problem  without  allowing  changes  to  x,.  That  is,  if  the 
6's  are  held  fixed  at  zero,  ODR  reduces  to  OLS  and  trust 
region  methods  require  0(np2)  operations  per  iteration. 
Our  algorithm,  by  exploiting  the  structure  of  (ODR),  still 
requires  only  0(np2)  +  0(nm)  operations  per  iteration  to 
solve  the  problem. 

While  we  have  designed  and  implemented  the  algo¬ 
rithm  to  handle  the  full  generality  of  (2.1),  the  notation 
is  considerably  simplified  by  assuming  x,  €  S1.  We  tem¬ 
porarily  make  this  assumption  and  rewrite  (2.1)  into  the 
form  of  an  OLS  problem  by  the  following  device.  Let 


“5 


(2.2) 


0)-yi),  i=l,. . .  ,n 
„  i=n+l,  2n. 


J  €  Rn*p  :  J„  = 


dgt(/M)  _  w,df(x,  +  6,;/?) 


Also  let  G  :  Rp+n  — *  J72"  have  component  functions  g,(r;) 
where  t?  =  (f )  •  Now  (ODR)  becomes 

2n 

(2.3)  min||G(>?)||2  =  min  ^  6))J 

"  w  .=i 

which  is  an  OLS  problem  with  (p  +  n)  parameters  and 
2n  equations.  (In  all  cases  in  this  paper,  ||-||  denotes  the 
f 2  vector  or  matrix  norm.)  Direct  application  of  trust 
region  methods  to  (2.3)  would  require  0(2n(n  +  p)2)  op¬ 
erations  per  iteration  which  rapidly  becomes  prohibitive 
if  n  is  large.  (Recall  that  n  is  usually  far  greater  than  p 
in  practice.) 

The  basic  idea  of  a  trust  region  strategy  is  to  choose 
as  the  step  that  vector  which  minimizes  a  linear  approx¬ 
imation  to  G  over  a  region  in  which  the  linearization 
is  a  “reasonable"  approximation  to  G  .  Specifically,  if 
G'(r)c)  €  fl2"*(n+p)  is  the  Jacobian  matrix  of  G  evalu¬ 
ated  at  the  current  iterate,  r?c,  then  the  step  z  is  chosen 
by  solving 

min||G(»/c)  +  GV)*ll2 

(2.4) 

subject  to||Zz||  <  r 

where  Z  is  a  nonsingular  (usually  diagonal)  scaling  matrix 
and  t  is  the  trust  region  radius.  It  is  easy  to  show  that 
the  solution  to  (2.4)  is  given  by  the  z(a)  satisfying 

(2.5)  (G'(vc)TG'{ric)  +  <*ZTZ)  z(o)  =  - G' (r,')T G(r,c ) 

where  a  >  0  is  the  Lagrange  multiplier  for  the  inequal¬ 
ity  constraint.  Note  that  if  ||z(0)||  <  r,  o  =  0  and  the 
constraint  is  inactive.  Otherwise  a  >  0  and  the  con¬ 
straint  is  active.  Equation  (2.5)  is  the  famous  Levenberg- 
Marquardt  formula,  but  this  derivation  has  given  rise 
to  more  stable  and  robust  implementations.  (See,  e.g., 
[Mor77]  and  jDenS83]).  Clearly  (2.5)  can  be  regarded 
as  the  “normal  equations”  for  the  extended  least  squares 
problem, 


C'(nc) 

cM'Z 


where“=2"  means  “equal  in  the  least  squares  sense.” 

Our  implementation  is  based  on  the  careful  exploita¬ 
tion  of  the  structure  of  the  extended  Jacobian  matrix  in 
(2.6).  From  (2.2)  we  have  that 


l) 


V  €  Rnxn  :  Vt]  = 


d0,  d0, 

i  =  l,...,n,  j  =  l . p; 

,,  _  d?.(/M)  _  t v,df(x,  +  6,;0) 


De  R" 


i  =  l,...,n  j  =  l,...,n; 

D  =  diag{u),<f,,i  =  l,...,n}. 


Here,  we  have  omitted  the  arguments  of  J  and  V  for  the 
sake  of  clarity.  Observe  that  since  gi  only  depends  on 
6t,  t  =  l,...,n, 

„  f  dg,(P,6)  .  ,  ) 

v  =  d'&g\—d6r = 

Commensurate  with  this  partitioning  of  G'(r/C),  ric  is 
naturally  partitioned  into  components  (/3c,6c)r  and  the 
step  z  into  a  step  in  /?  ,  say  s  ,  and  a  step  in  6  ,  say  t. 
Furthermore,  we  allow  for  s  to  be  scaled  by  a  nonsingular 
diagonal  scaling  matrix  S  and  t  by  a  nonsingular  diagonal 
matrix  T.  Thus  (2.6)  becomes 


rG>i 

5 

Gj 

M =2_ 

0 

.  0  . 

where  Gj  is  the  first  n  components  of  G  and  G2  is  the 
last  n  components. 

Now,  if  x,  6  Rm ,  then  (2.7)  will  have  the  same  form 
except  that  V  e  Rnxnm\T,D  6  are  still  diago¬ 

nal;  and  V  ,  instead  of  being  diagonal,  has  the  “staircase” 
structure  which  is  illustrated  for  n  =  4  and  m  =  3  as 
follows: 

'xxx 

V  =  XXX 

xxx 

xxx. 

The  rest  of  the  development  now  allows  x,  6  Rm. 

Boggs,  Byrd,  and  Schnabel  [BogBS85j  derive  in  detail 
an  efficient  procedure  for  solving  (2.7).  Here  we  just  give 
this  procedure  and  summarize  its  derivation. 

By  forming  the  normal  equations  for  (2.7),  it  is 
straightforward  to  show  that  the  s  that  solves  (2.7)  is 
the  solution  to 


(a>/2s)5=2(o) 


j  --  (/-  VP-'VT),/2J 


y  -  (I  VP  'VT)  1/2(-G,  -  VP-‘(VrG,  +  DG2)] 

with  P  defined  by  P  -  VTl'  +  D2  ->■  oT2.  The  same 
derivation  shows  that  given  s,  the  t  that  solves  (2.7)  is 
given  by 


(2.11)  t  =  ~P-'(VtGi  +DG2  +  VTJs). 

Boggs,  Byrd,  and  Schnabel  then  show  that  (2.9),  (2.10). 
and  (2.11)  are  equivalent  to 

1  l‘/2  \ 

(2.12)  J  =  diag  <  1  +  ^  ,  *  =  l,...,nW 

(2.13) 

y=  -diag{[l+u>,]1/2,  »  =  l,...,n}  (G,-V£-‘DG2). 


t  =  ~  E~l  [v^diag  |  — ^  — ,  «  =  1, . . .  ,«| 

(2.14)  L  l 1  +  w»  I 

(Cm  +  Js  -  VE~'DG2)  +  DGj]. 
where  E  is  defined  by  E  =  D2  +  aT 2  and 

m  V2 

r ~ — -  *  =  1>- ••>n- 

Equations  (2.12)— (2.13)  show  that  the  system  of  equa¬ 
tions  (2.8)  can  be  formed  in  0(np  +  nm)  operations.  The 
solution  of  (2.8)  then  involves  a  QR  decomposition  of  3 
(accomplished  by  Householder  transformations  with  col¬ 
umn  pivoting)  and  then  a  sequence  of  plane  rotations  to 
eliminate  a1/2S.  The  cost  is  for  this  phase  is  dominated 
by  the  0(np2)  operations  for  the  QR  decomposition  of 
J.  It  is  then  easily  verified  that  the  cost  of  calculating  t 
from  (2.14)  is  dominated  in  cost  by  the  0[np)  operations 
needed  to  form  Js  and  several  0(nm)  terms. 

Thus  the  leading  cost  of  calculating  a  step  for  ODR  is 
the  same  0{np 2)  operations  needed  to  do  the  factorization 
of  an  n  x  p  matrix  as  in  OLS.  The  only  additional  costs  are 
a  small  number  of  calculations  costing  0(nm)  or  0(np) 
operations. 

It  may  occur  to  the  reade-  that  an  efficient  QR  factor¬ 
ization  of  the  matrix  in  (2.7)  might  yield  a  procedure  with 
the  same  order  of  work.  By  re-ordering  the  upper  2x2 
blocks,  one  can,  indeed,  do  the  factorization  of  this  part 
in  0(np2)  operations.  The  subsequent  elimination  of  the 
aS  and  aT  blocks,  however,  would  require  0((nm  +  p)2) 
operations  for  each  a.  It  is  for  this  reason,  as  well  as  oth¬ 
ers,  that  Schwetlick  and  Tiller  jSchT85]  do  only  a  “partial” 
trust  region  strategy,  i.e.  their  trust  region  only  applies 
to  the  step  in  the  0  variables.  In  some  badly  scaled  prob¬ 
lems,  however,  (e.g.,  Example  3  in  Section  4)  the  ability 
to  scale  and  constrain  the  step  in  6  is  essential  to  solve 
the  problem. 

The  above  formulas  for  s  and  t  are  used  for  each  a 
value  in  (2.5).  Thus  in  order  to  complete  the  specification 
of  the  algorithm,  we  need  to  provide  the  procedure  for 
computing  the  trust  region  parameter  a  to  satisfy  (2.4) 
and  for  adjusting  the  trust  region  radius  r.  These  details 
are  discussed  in  |BogBS85|. 


Since  many  users  will  want  to  compare  the  results  of 
OLS  with  ODR,  our  code  includes  an  option  to  do  OLS. 
Enabling  this  option  merely  initializes  the  6  vector  to  zero 
and  sets  V  to  zero  whenever  it  is  computed.  It  is  easily 
verified  that,  in  this  case,  each  step  reduces  to  the  OLS 
Levenberg-Marquardt  step  and  yields  t  -  0  leaving  6  =  0. 
Using  this  procedure  to  do  OLS,  therefore,  is  equivalent 
to  a  standard  OLS  algorighm  with  a  moderate  extra  al¬ 
gebraic  overhead. 

3.  Local  and  Global  Convergence  Analysis 

Trust-region-Levenberg-Marquardtmethods  applied  to 
the  general  nonlinear  least  squares  problem  have  well 
known  convergence  properties  (see  e.g.,  [Pow75],  [Mor77], 
[MorSSl],  [ShuSB85]).  As  long  as  the  sequence  of  Jaco¬ 
bian  matrices,  {G'(r?fc)},  is  uniformly  bounded,  then 

liro  G'(mt)rG(mt)  =  0, 

k— »oo 

so  that  any  cluster  point  satisfies  the  first  order  necessary 
conditions  for  a  local  minimizer.  These  results  apply  to 
our  algorithm  and  nothing  more  needs  to  be  said  regard¬ 
ing  global  convergence. 

The  local  convergence  behavior  of  general  trust-region- 
Levenberg-Marquardt  methods  for  nonlinear  least  squares 
is  discussed  by  Byrd  and  Schnabel  [ByrS86]  who  show 
that,  if  there  is  a  cluster  point  q.  where  G'(q,)  is  non¬ 
singular,  then  the  iterates  converge  at  least  linearly  to  q. 
independent  of  the  size  of  G(rj.)_  This  theory  also  applies 
to  our  algorithms.  If,  in  addition,  the  residual  G(r/.)  is 
sufficiently  small,  Byrd  and  Schnabel  show  that  asymp¬ 
totically  the  trust  region  constraint  becomes  inactive,  and 
that  the  Levenberg-Marquardt  algorithm  reduces  to  the 
Gauss-Newton  iteration 

Vk+i  —  *)k  ~  [G'(r)fc)TG'(r7*)]'1  G'(T/*)rG(r>*) 

and  is  linearly  convergent  to  r).  .  The  linear  convergence 
analysis  of  the  Gauss-Newton  method  is  well  known  (see 
e.g.,  [OrtR70j,  |DenS83]).  The  constant  of  linear  conver¬ 
gence  depends  upon  the  smallest  singular  value  of  G'(q,), 
the  residual  G(q.),  and  the  nonlinearity  of  G(q)  near  »/.. 

The  small  residual  analysis  is  particularly  relevent  to 
ODR  because  most  applications  of  ODR  will  have  small 
residuals.  This  is  especially  true  when  ODR  is  used  to 
consider  errors  in  independent  variables  in  parameter  es¬ 
timation,  because  errors  in  the  independent  variables  are 
most  likely  to  be  considered  when  the  model  and  the  de¬ 
pendent  variable  measurements  are  accurate,  which  im¬ 
plies  that  the  residuals  will  be  small. 

It  turns  out  that  the  application  of  the  local  Gauss- 
Newton  analysis  to  ODR  is  nontrivial,  although  the  ex¬ 
pected  results  can  be  proven.  To  simplify  the  algebra 
here,  we  consider  a  version  of  the  ODR  problem  (2.1) 
with  the  simplified  weighting  scheme  w,  =  1  and  d,  =  o 
for  all  t,  i.e., 

n 

(3.i)  [(/(z<  +  ^;0)  -  y>)2  + 


m 


where  a  >  0.  This  weighting  still  allows  the  metric  of  dis¬ 
tance  from  the  curve  f(x\0)  to  the  data  points  (i,, t/,)  to 
vary  from  vertical  (as  a  — >  oo)  to  orthogonal  ( a  =  1)  to 
horizontal  (as  a  — *  0).  (We  explain  this  statement  more 
carefully  later  in  this  section.)  This  is  all  the  generality 
in  the  weighting  that  is  usually  used  in  practice,  and  pre¬ 
cisely  what  we  use  in  most  of  our  computational  results 
in  Section  4. 

To  further  simplify  notation,  we  rewrite  (3.1)  as 


min  Ufa)TUfa)  +  o26T6 
n 


or  equivalently, 


minGfa)TGfa) 

*! 


where  6  =  (6^ . tf)7,  V  =  [PT  ,6T)T ,  Ufa).-  = 

f(xi  +  6,\0)-yx,  i  =  1, . . .  ,n,  and  Gfa)  =  (R(r))T  ,o6T)T . 
Our  analysis  will  not  depend  upon  the  special  form  of 
Ufa)  in  any  way.  Recall  that 


G'fa)  = 


v(n)\ 
0*1  ) 


where  ./fa)  and  V fa)  are,  as  in  Section  2,  the  derivatives 
of  Ufa)  with  respect  to  0  and  6  respectively. 

The  difficulty  in  applying  standard  Gauss-Newton 
analysis  to  (3.2)  is  that  Gfa)  and  G'fa)  are  functions  of  a. 
In  Theorem  3.1  we  show  that  the  convergence  can  be  ana¬ 
lyzed  in  terms  of  the  properties  of  ./fa),  V  fa),  Ufa.),  and 
6.  only,  i.e.,  independent  of  a  except  for  its  role  in  deter¬ 
mining  T),.  For  the  proof  of  Theorem  3.1,  see  [BogBS85], 
In  the  statement  of  Theorem  3.1,  we  often  omit  the 
argument  rj;  i.e.,  we  denote  Gfa.)  and  Gfa0)  by  G.  and 
Go.  respectively,  and  likewise  for  other  symbols  in  place 
of  G.  Also  for  J  having  full  column  rank,  J+  denotes 
(JTJ)~lJT,  and  for  V  having  full  row  rank,  V+  denotes 
VT{VVT)-'.  Note  that  =  ||(JrJ)->||. 

Theorem  3.1.  Let  Ufa )  :  U‘  — >  Un  be  continuously 
differentiable  in  an  open  convex  set  D  C  U‘.  Let  r\r  = 
(/JT,6T),  0  £  UP,  6  £  Rq,  let  a  be  a  positive  scalar,  and 

let  Gfa)  --  ^  ^  ^  .  Assume  there  exists  q*  £  D  such 

that  G'(rj)TG(r )’)  =  0,  and  that  there  exists  7  >  0  for 
which 

ll^'fa)  -  u'fa*)||  <7\\n-v'\\ 

for  all  t)  £  D.  Define 


Cl  =t[II(^.)",|I  ii*.ii 

+  (i  +  ||(j.7'j.)-‘||  ||V.||2)||V.+  ||  \\6.\\] 

C2  =  fa/2)[||V.+  ||  +  ||j+||(l  +  ||V.||  ||V.+  ]|)]. 

If  Ci  <  1,  then  for  any  c  £  (1,1/ci) ,  there  exists  c  >  0 
such  that  for  all  rj0  for  which  |fao  -  »?.||  <  e,  the  sequence 
generated  by  the  Gauss-Newton  method 

n*+i  =  Vk  -  (G'fa*)TG'(q*))_1  G'fa*)rGfa(t) 


is  well  defined,  converges  to  ij.  and  obeys 

lfak+1  -  *?.||  <  C  (c,  +  c2  Ifat  -  1?. II)  Ifa*  -  t».||  • 

In  practical  ODR  applications,  the  user  may  wish  to 
solve  (3.1)  for  various  values  of  o.  Now  we  consider  the 
behavior  of  the  ODR  problem  (3.2)  as  the  parameter  o 
is  varied.  For  this  purpose,  let  us  denote  the  global  min- 
imizer  to  (3.2)  by  q.fa).  Then  by  standard  analyses  of 
barrier  function  methods,  (see  e.g.,  |FiaM68)  or  [Lue73]) 
we  know  that  the  limit  of  q.fa)  as  o  — *  00  is  the  solution 
to 

min  ||Ufa)||s  subject  to  6  =  0, 
n 

i.e.,  the  standard  OLS  problem 

(3.3)  min||U(/?,0)||2. 

P 

Similarly,  the  limit  of  q.fa)  as  o  — »  0  is  the  solution  to 
the  implicit  least  squares  (ILS)  problem 

(3.4)  min||£||2  subject  to  Ufa)  =  0. 

*1 

In  the  data  fitting  context  where  Ufa),  =  /(z, +6,;/9)-y,, 

(3.3)  is  the  standard  problem  where  the  independent  vari¬ 
ables  x ,  are  assumed  exact  so  that  the  metric  of  distance 
is  in  the  y  (vertical)  direction  only.  In  constast  (3.4)  is 
the  case  where  the  dependent  variables  y,  are  assumed 
exact  and  the  independent  variables  x,  inexact,  so  that 
the  metric  is  entirely  in  the  x  (horizontal)  direction. 

The  standard  analysis  of  barrier  function  methods  also 
shows  that  |!Ufa.  fa))  ||  is  a  monotonically  increasing  func¬ 
tion  of  o ,  and  that  ||<f>.  fa)||  is  a  monotonically  decreasing 
function  of  o.  This  means  that  for  all  o  £  (0,  00),  the  val¬ 
ues  of  ||Ufa.fa))||  and  [|6.fa)||  are  bounded  above  by  the 
optimal  objective  function  values  for  problems  (3.3)  and 

(3.4) ,  respectively.  In  data  fitting  terms,  for  any  o,  the 
norm  of  the  optimal  vertical  residuals  in  ODR  is  bounded 
above  by  the  norm  of  the  optimal  residuals  in  OLS,  and 
the  norm  of  the  optimal  horizontal  residuals  in  ODR  is 
bounded  above  by  the  norm  of  the  optimal  residual  for 
the  ILS  problem.  The  computational  results  of  Section  4 
demonstrate  these  relationships. 

Combining  the  above  facts  with  Theorem  3.1  shows 
that,  if  the  optimal  objective  function  values  for  prob¬ 
lems  (3.3)  and  (3.4)  are  sufficiently  small,  and  if  Jfa.fa)) 
and  Vfa.fa))  are  sufficiently  well-conditioned  for  all  o  £ 
(0,  00),  then  the  Gauss-Newton  algorithm  applied  to  (3.2) 
is  linearly  convergent  for  any  o  £  (0,  00). 

Corollary  3.2.  Let  77,  0,  6,  Ufa),  Gfa),  Jfa),andVfa) 
be  defined  as  in  Theorem  3.1.  For  any  o  £  (0,  00),  let 
q.fa)  —  [0,(o)T ,  6,(o)T)T  denote  the  global  solution  to 

min||Ufa,  6)\\2+o2\\6\\2. 

Also  let  0ols  denote  the  global  solution  to  the  ordinary 
least  squares  problem 

mm||Ufa,  0)|| 2 


OUMMUAMV 


1®$ 

w 

Yfa.V 


_r_rrT<,M*l4t 


and  let  (3/is,  &ils)  denote  the  global  solutions  to  the 
implicit  least  squares  problem 

min||fi||s  subject  to  R(3,  t)  =  0. 

0,  6 

Let  Rois  =  R(3ols)-  Assume  that  there  exist  i  > 
0,  >  0  such  that  for  each  o  €  (0,  oo), 

P'(»j)  -  J*'(>?.(o))ll  <  t  !!•?  -  '»•(»)!! 

for  all  rj  for  which  ||r/  -  r?. (cr) |[  <  t.  Assume  also  that  for 
all  o  6  (0,  oo),  J(n.(°))  and  V(n.(o))  have  full  column 
and  row  rank,  respectively,  and  let  J ,  J  + ,  V,  and  V"  be 
uniform  bounds  on  the  norms  ||J(r/.(<r))|| ,  \\J(q,  (o))  + 1| , 
||V(,.  (<,))!! ,  and  |lV(i?.(ff))  +  li .  respectively,  over  all  o  6 
(0,  oo ).  Define 

c,  =7  (i+)J«ots  +  (l+  (J+yV^VStts 

ci  =  (i/2)  [v++i+  (x  +  VK  +  )]. 

Ifct<  1,  then  for  any  c  €  (1,  l/ci),  there  exists  i  >  0 
such  that  for  any  o  6  (0,  oo),  the  sequence  {qk}  generated 
by  the  Gauss-Newton  method  applied  to  (3.2)  starting 
from  any  no  for  which  ||rj0  -  q.(<r)||  <  <  is  well-defined, 
converges  to  rj.  (o),  and  obeys 

||rj*+i  -  «?.(<r)ll  <  e  [ci  +  ci  |b*  -  q.(o)||]  N*  ~  »»•(*)  II  • 

4.  Computational  Testing. 

In  this  section  we  report  the  results  of  preliminary  com¬ 
putational  testing.  These  tests,  consisting  of  two  con¬ 
trived  problems  and  one  real  problem,  were  selected  in 
order  to  illustrate  the  effectiveness  of  the  implementation 
and  to  demonstrate  the  performance  of  the  basic  algo¬ 
rithm.  They  also  allow  us  to  contrast  ODR  and  OLS, 
which  can  have  rather  dramatic  differences,  and  to  point 
out  some  of  the  inherent  difficulties  in  ODR  problems. 

The  contrast  between  OLS  and  ODR  is  best  brought 
out  in  terms  of  the  parameter  o  and  the  function  3(o) 
from  Section  3.  (Recall  that  /?(oo)  corresponds  to  the 
OLS  solution.)  Since,  in  practice,  the  correct  value  of  o 
may  not  be  known  exactly,  it  is  of  interest  to  compute 
3(o)  for  various  values  of  o. 

The  algorithm  was  coded  in  Fortran  77  and  run  in  dou¬ 
ble  precision  on  the  Perkin-Elmer  3230  at  the  National 
Bureau  of  Standards  (NBS).  Graphs  of  the  fitting  func¬ 
tions  for  all  three  examples  are  given  in  [BogBS85]. 

Example  1.  Consider 


and  define  x,  =  .01  -t-  (i  -  1)  *  05,  t  —  1, ...  ,40.  Next  let 


.  =  1,...,40. 


Now  we  perturb  the  data  points  as  follows: 


x,  :=  x,  +  rx 
Vi  :=  V .  +  ry 

where  the  rx  are  uniformly  distributed  on  (-.05,-05) 
and  the  ry  are  uniformly  distributed  on  (-.25,-25).  The 
model  for  the  data  was  taken  to  be 


*  X-02 

and  the  ODR  program  was  run  with  several  values  of 
<7.  The  results  are  reported  in  two  tables.  Table  1  was 
generated  by  setting  o  =  1  and  taking  3°  —  (1,  I)r-  Sub¬ 
sequent  solutions  for  higher  values  of  o  used  the  previous 
solution  for  the  initial  approximation.  In  addition  to  the 
values  of  3(o),  Table  1  contains  the  number  of  evaluations 
of  the  extended  residual  function  G  (cf  (2.3))  and  its  Ja¬ 
cobian,  and  the  optimal  values  of  ||ii(»/(o))|j  and  ||fi(o))|| 
for  each  value  of  o.  Since  the  value  of  6  was  expected  to 
be  approximately  the  size  of  the  variance  of  the  errors, 
we  set  the  weight  T  =  10.  Table  2  is  organized  just  as 
Table  1 ,  but  the  results  were  generated  by  starting  with 
the  OLS  solution  using  3°  =  (1, l)T  and  then  decreasing 
o. 

Obviously,  Tables  1  and  2  exhibit  a  nonuniqueness  of 
the  solutions.  It  appears  that  there  are  two  local  solutions 
for  the  OLS  problem  corresponding  to  the  asymptote  to 
-f-oo  being  on  the  left  or  right  half  of  the  curve,  and  that 
the  trajectories  emanating  from  these  solutions  come  to¬ 
gether  around  o  =  600  or  that  the  trajectory  represented 
in  Table  2  fails  to  be  continuous  near  o  -  600.  A  possible 
means  of  investigating  this  phenomenon  is  to  write  the 
differential  equation  describing  the  trajectory  3(o)  “d 
to  study  possible  bifurcation  points.  This  is  not  pursued 
here. 

Observe  that  3i  determines  the  location  of  the  asymp¬ 
tote  and  thus  the  data  locate  this  parameter  very  well. 
The  graph  of  the  OLS  fit,  however,  shows  that  the  data 
point  near  the  asymptote,  corresponding  to  (1.01,100)T, 
completely  dominates  the  fitting  process  for  OLS  in  Ta¬ 
ble  2  and  results  in  a  value  of  -.3180  for  3\.  The  ODR 
fit  is  not  nearly  so  influenced  by  this  data  point  and,  for 
a  broad  range  of  o,  does  a  very  good  job  of  fitting  the 
data.  This  last  point  is  important,  namely  that  the  pa¬ 
rameter  values  do  not  vary  much  as  a  function  of  o,  which 
means  that  o  may  not  need  to  be  known  with  much  ac¬ 
curacy.  The  stability  of  3(?)  has  been  noticed  on  all  of 
our  examples  and  on  problems  not  reported  here.  This  is 
not,  of  course,  a  proof  that  this  phenomenon  holds  more 
generally. 

A  further  difference  between  the  OLS  and  the  ODR 
fits  is  that  the  errors  for  both  the  the  OLS  fits  do  not 
appear  to  be  random.  The  graph  of  the  OLS  fit  shows 
that  almost  all  of  the  errors  to  the  left  of  the  asymptote 
are  negative  while  all  to  the  right  are  positive.  The  ODR 
errors  for  reasonable  values  of  o  appear  to  be  much  more 
random. 

An  examination  of  the  computational  results  reveals 
that  the  only  hard  optimization  problem  in  each  set  is 
the  first.  Subsequent  solutions  are  found  very  quickly  ex- 
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cept,  of  course,  for  the  problem  corresponding  to  a  -  500 
in  Table  2  which  appears  to  have  jumped  across  a  discon¬ 
tinuity  in  0(a).  A  detailed  examination  of  the  iteration 
process  shows  that  the  algorithm  sometimes  slows  down 
(a  very  small  value  of  r  is  generated)  but  then  recovers 
and  final  convergence  is  with  full  Gauss-Newton  steps. 
For  the  case  a  =  500,  fairly  large  steps  in  6  were  gener¬ 
ated  which  led  to  apparent  convergence  with  very  poor  0 
values  (near  (0,1))  and  very  large  values  of  6  (O(l))-  In 
this  case,  a  very  small  value  of  r  was  produced.  When 
the  procedure  was  restarted  with  a  large  value  of  r,  the 
algorithm  immediately  stepped  over  this  bad  region  and 
converged  quickly  to  the  correct  solution.  Thus,  it  ap¬ 
pears  to  be  important  to  scale  the  step  in  6  correctly  and 
to  be  on  the  lookout  for  unrealistic  solutions. 

Example  2.  This  example  is  a  two  dimensional  version 
of  Example  1.  Here  we  take  x  6  R2  and 


Xl  +  Zj  -  1 


This  function  has  a  line  of  singularities  along  X\  +  z2  =  1. 
We  take  the  data  to  be  on  the  rectangular  grid  of  width 
.1  in  the  x\  direction  and  width  .2  in  the  x 2  direction. 
The  first  point  is  (,01,.0l)T  and  there  are  10  points  in 
the  X\  direction  and  5  points  in  the  x2  direction,  y  is  the 
evaluated  at  these  points  and  the  data  are  then  perturbed 
according  to  the  following: 

(*i)i  :=  (*i)i  +  rx 
(x2)t  :=  (z2),  +  rx 

y,  :=  y.  +  ry 

where  rx  are  normally  distributed  with  mean  0  and  stan¬ 
dard  deviation  .01  and  the  ry  are  distributed  normally 
with  mean  0  and  standard  deviation  .04. 

The  form  of  the  model  is 


y  = 


Pi 


02X\  +  03X2  -  1 


on  intermediate  iterations.  The  iteration  stalled  with  an 
indication  of  convergence  due  to  x-convergence  and  a  very 
small  value  of  the  trust  region  radius.  A  restart  (which 
resets  the  trust  region  radius  to  a  larger  value)  then  allows 
the  iterates  to  step  over  this  flat  area  and  converge  very 
quickly  to  the  correct  answer. 

The  non-uniqueness  observed  in  Example  1  was  again 
observed  here.  The  details  are  not  reported,  but  we  found 
a  second  OLS  solution  which  led  to  a  trajectory  of  solu¬ 
tions  that  finally  joined  the  above  trajectory  at  o  =  2. 

Example  3.  The  data  here  are  actual  measurements 
from  a  calibration  run  on  an  electronic  device  which  was 
intended  to  give  a  flat  response  over  a  wide  range  of  fre¬ 
quencies.  In  the  (x,y)-data,  the  x-values  are  in  units  of 
frequency  squared  and  the  y-data  are  the  gain.  The  z- 
values  are  scaled  to  the  interval  (0,1)  with  several  mea¬ 
surements  made  in  each  decade  from  10  ~ 8  to  1.  More 
measurements  were  taken  at  the  higher  frequencies  since 
most  of  the  important  information  is  obtained  there. 

The  model  for  this  data  was  obtained  from  theoretical 
considerations  and  has  the  form 

4 

yt  ~  —a~  +  /*>  1  =  1,..., 44 

fr ;  *»  +  'Tj 

where  the  parameters  to  be  determined  are 

0  =  (a,,...,o4,p,Tf, . 74)t. 

Estimates  of  the  pole  locations— the  negative  7- values — 
are  likewise  obtained  from  other  analyses.  The  7-values 
are  approximately 

7!  =  1.38  x  10'3 

72  =  5.96  x  10  2 

73  =  6.71  x  101 

74  =  107  x  109. 


The  results  are  given  in  Table  3  which  is  organized  as 
Table  1.  Again  the  values  of  0(a)  do  not  vary  quickly, 
the  location  of  the  asymptote  is  well-determined  by  the 
data,  and  only  0\  changes  much  as  o  increases.  Graphs  of 
the  fitting  functions  show  that  the  fits  depend  more  and 
more  on  the  points  near  the  asymptote  as  a  increases. 
Here  the  insistence  on  near  vertical  measures  of  the  error 
forces  0\  to  assume  smaller  values  which  has  the  effect 
of  flattening  the  function  as  much  as  possible  near  the 
asymptote.  This,  of  course,  tends  to  minimize  the  vertical 
component  of  the  error.  As  in  Example  1,  the  errors  for 
the  OLS  fit  do  not  appear  random  while  those  for  the 
ODR  fits  do. 

Note  that  the  first  solution,  corresponding  to  a  =  1, 
was  computed  with  some  difficulty.  (This  is  the  same 
situation  as  occured  for  0  =  500  from  Table  2.)  In  these 
cases,  the  terrain  in  parameter  space  (0  and  S)  appears 
rather  flat  and  fairly  large  values  of  6  were  again  obtained 


Since  all  of  the  poles  are  negative  and  all  of  the  data 
have  positive  z-values,  there  is  no  problem  with  being 
close  to  the  asymptotes.  The  range  of  the  x-values,  how¬ 
ever,  implies  the  need  to  scale  the  trust  region.  We  used, 
for  the  diagonal  scaling  matrices  5  and  T ,  the  following: 

1 


* '  10m 


It  turns  out  that  the  measurements  are  proportionately 
more  accurate  at  the  lower  frequencies  and  we  therefore 
took  the  d-weights  to  be  the  same  as  the  t-weights. 

While  the  data  were  measured  quite  accurately,  there 
were  simply  no  data  at  a  sufficiently  high  frequency  to 
warrant  keeping  the  two  terms  corresponding  to  j  —  3 
and  ]  =  4  in  the  model.  This  situation  was  evidenced 
by  the  fact  that  the  Jacobian  J  had  five  almost  identical 
columns. 


Table  S 


With  these  terms  removed,  the  resulting  problem  was 
easily  solved  as  follows.  Using  a  feature  of  the  program 
which  allows  certain  parameters  to  be  held  fixed  at  spec¬ 
ified  values,  we  fixed  the  pole  values  (the  -7- values)  and 
used  an  OLS  estimate  of  the  remaining  linear  parameters. 
We  then  freed  all  of  the  parameters  and  did  an  OLS  fit 
and  and  ODR  fits  with  several  values  of  a.  In  doing  the 
ODR  fits,  we  first  specified  a  <r-value  of  .01  since  the  gain 
measurements  in  this  data  set  were  100  times  more  accu¬ 
rate  than  the  frequency  measurements.  Other  values  of  a 
were  subsequently  used  for  comparison.  The  results  are  in 
Table  4.  Virtually  no  difference  appears  between  the  two 
fits  at  the  lower  frequencies,  but  some  differences  occur  at 
the  higher  frequencies.  In  the  enlargements  of  the  fitting 
functions,  one  can  easily  see  that  the  contribution  of  the 
error  in  the  z-values  causes  ODR  to  get  a  significantly 
better  fit  than  OLS.  While  the  ^-values  are  not  reported 
here,  there  were,  again,  very  slow  changes  in  0 (o). 

In  this  section  we  have  shown  that  our  algorithm  is  ef¬ 
fective  on  highly  nonlinear  problems,  but  that  these  prob¬ 
lems  themselves  often  have  multiple  solutions  and  other 
difficulties  which  imply  that  potential  solutions  need  to 
be  studied  carefully.  In  subsequent  papers,  we  will  pro¬ 
vide  a  more  complete  description  of  our  implementation 
and  further  results  on  its  performance. 
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Table  1 

Evals  of 
G 

G' 

Final  Values 

IWiWWWII 

1 

1.023 

1.006 

70 

25 

0.223 

0.355 

2 

1.021 

1.005 

6 

5 

0.454 

0.223 

5 

1.015 

1.004 

6 

4 

0.771 

0.128 

25 

0.9847 

1.002 

6 

5 

1.280 

0.080 

100 

0.9247 

0.9972 

9 

8 

3.204 

0.081 

300 

0.9881 

0.9928 

13 

12 

10.408 

0.035 

500 

0.9487 

0.9953 

12 

11 

15.524 

0.018 

1000 

0.8248 

0.9937 

10 

9 

18.881 

0.007 

oo, 

0.6867 

0.9909 

7 

6 

21.774 

0. 

Table  2 

Evals  of 

Final  Values 

o 

AW 

AW 

G 

G' 

IW-iWWWII 

OOj 

-0.3170 

1.010 

40 

22 

104.709  0. 

1000 

-0.3355 

1.095 

20 

15 

104  223  0.007 

700 

•0.3845 

1.093 

27 

21 

103.660  0.015 

500” 

0.9487 

0.9953 

103 

43 

15.524  0.018 

o 

AW 

AW 

AW 

Evals  of 
G 

G' 

Final  Values 

IWiWlRfWII 

1” 

0.8988 

0.9482 

1.015 

147 

60 

0.184 

0.670 

2 

0.9223 

0.9478 

1.019 

7 

6 

0.428 

0.618 

4 

0.9345 

0.9506 

1.027 

8 

7 

0.989 

0.540 

10 

0.9049 

0.9510 

1.047 

9 

8 

2.379 

0.429 

40 

0.7148 

0.9568 

1.044 

10 

9 

6.41 1 

0.315 

100 

0.3645 

0.9343 

0.9894 

22 

16 

19.934 

0.174 

500 

0.0914 

0.8830 

0.9675 

25 

17 

30  424 

0.039 

oo 

0.1192 

0.8883 

0.9338 

27 

12 

77.440 

0. 

a 

Evals  of 

G 

Table  4 

G' 

Final  Values 

IIR(iW)ll  PWII 

oo 

19 

9 

1.8702 

0. 

.01 

18 

7 

0.0005 

0.03993 

.1 

12 

7 

0.0016 

0.00482 

1. 

5 

4 

0.0018 

0.00006 
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THE  APPLICATION  OF  CONVEX  HULLS  IN  MULTIPLE  DIMENSIONS 
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Convex  hulls  and  minimum  covering  ellipsoids  are 
two  possible  methods  for  defining  the  region  of 
space  enclosed  by  given  points.  Algorithms  for 
computing  these  regions  in  multidimensional  space  were 
implemented  and  investigated.  The  minimum  covering 
ellipsoid  implementation  takes  much  less  time  and 
storage  than  the  convex  hull.  The  expected  probability 
content  for  the  convex  hulls  in  higher  dimensions  is 
small. 

KEY  WORDS:  Convex  Hull;  Minimum  Covering 
Ellipsoid 

1.  INTRODUCTION 

The  computation  of  a  convex  region  containing 
a  set  of  points  has  been  extensively  studied  and 
applied  in  two  dimensions.  Both  convex  hulls 
and  minimum  covering  ellipsoids  have  been  used  in 
statistical  applications.  The  convex  hull  of  a  set  of 
points  is  the  smallest  convex  region  containing  all  of  the 
points.  The  minimum  covering  ellipsoid  is  the  ellipsoid 
of  smallest  content  that  contains  the  points.  Figure 
1  shows  the  convex  hull  and  the  minimum  covering 
ellipsoid  for  20  points  in  two  dimensions.  Although 
convex  hulls  and  minimum  content  ellipsoids  have 
applications  in  more  than  two  dimensions,  programs 
for  computing  these  regions  in  multidimensional  cases 
are  not  readily  available.  In  this  study  we  report 
the  implementation  of  programs  to  compute  convex 
hulls  and  minimum  covering  ellipsoids  in  multiple 
dimensions.  The  time  and  space  requirements  of 
the  implementations  are  investigated  empirically  for 
random  points  from  normal  and  uniform  distributions. 


convex  hulls  for  ordering  multivariate  data.  According 
to  Huber  (1972)  the  idea  was  originally  suggested  by 
Tukey.  The  idea  is  to  take  all  points  on  the  convex 
hull  as  the  analog  of  the  extreme  order  statistics.  By 
peeling  off  successive  outer  layers  of  points,  one  can 
define  robust  estimators  of  location  (Seheult,  Diggle 
and  Evans,  1976)  or  robust  estimators  of  the  correlation 
coefficient  (Bebbington,  1978).  Convex  hulls  have  also 
been  used  in  constructing  non-par ametric  estimators  of 
densities  and  modes  (Eddy  and  Hartigan,  1977). 
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Figure  1.  Convex  Hull  and  Minimum  Covering 
Ellipse  for  n=20  Points  in  d=2  Dimensions.  The  convex 
hull  is  the  shaded  region. 


2.  CONVEX  HULLS 

2.1  Uses 

Convex  hulls  have  been  used  in  a  number  of 
statistical  contexts.  Kendall  (1966)  suggested  that  in 
a  discrimination  context  a  new  observation  could  be 
assigned  to  be  in  the  same  class  as  a  set  of  points  if 
the  new  point  was  in  the  convex  hull  of  the  given  class 
of  points.  Kendall  also  suggested  finding  the  extent  to 
which  groups  of  points  formed  distinct  classes  by  finding 
how  many  of  the  points  in  other  groups  were  in  the 
convex  hull  of  a  given  group.  Convex  hulls  have  also 
been  used  in  various  versions  of  peeling  or  trimming 
multivariate  data.  Barnett  (1976)  discusses  using 


2.2  Computation 

The  computation  of  convex  hulls  in  two  dimensions 
is  well  studied,  and  algorithms  are  readily  available. 
For  example,  the  S  software  (Becker  and  Chambers, 
1984)  includes  a  function  to  find  a  two  dimensional  hull. 
Preparata  and  Shamos  (1985)  describe  several  methods 
for  finding  planar  convex  hulls.  Examples  include  the 
quicksort  analog  first  published  by  Eddy  (1977).  In  two 
dimensions  such  algorithms  provide  easily  implemented, 
reasonably  efficient  algorithms.  However  the  complexity 
of  the  problem  escalates  tremendously  when  one  steps 
up  to  the  general  multidimensional  case.  Preparata 
and  Shamos  (1985)  describe  two  alternatives  for  the 
genera)  case,  the  gift-wrapping  method  of  Chand  and 
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Kapur  (1970)  and  the  beneath-beyond  method  of  Kallay 
(1981). 

The  algorithm  implemented  here  is  the  gift¬ 
wrapping  algorithm.  The  gift-wrapping  analogy  comes 
from  thinking  of  the  way  one  rotates  a  package 
from  face  to  face  by  pivoting  on  edges  of  the 
package.  The  reason  for  our  choice  of  this  algorithm 
is  historical.  The  interest  in  this  project  started  when 
the  Environmental  Protection  Agency  Research  Lab 
in  Duluth,  Minnesota,  wanted  a  convex  hull  program 
to  use  in  discrimination  much  in  the  way  suggested 
by  Kendall  (1966).  The  EPA  lab  had  tracked  down 
a  FORTRAN  program  written  by  Chand  and  Kapur 
at  Lockheed-Georgia.  Unfortunately  the  program  was 
written  in  FORTRAN  66,  and  the  code  was  not  easy 
to  follow  or  generalise.  Rather  than  wallowing  in 
FORTRAN  66  code,  Benson  implemented  the  algorithm 
from  scratch  in  the  programming  language  C.  Benson’s 
implementation  is  pointer  based  and  has  no  upper  limits 
(in  theory)  to  the  number  of  dimensions.  The  programs 
contain  over  1800  lines  of  code  and  have  been  run 
on  SUN  and  VAX  computers  using  UNIX  operating 
systems. 

The  primary  drawback  to  the  computations  is  that 
convex  hulls  in  higher  dimensions  become  very  complex. 
Even  if  all  faces  are  simplicial,  the  worst  case  time 
complexity  of  the  gift-wrapping  technique  for  n  points 
in  p  dimensions  is  0(nU*AI+1)  +  0(nlp/aJ  log  n)  and 
the  worst-case  number  of  faces  is  0(nW3J)  (Preparata 
and  Shamoe,  1985,  page  130).  Hence  both  the  time 
required  to  compute  the  hull  and  the  space  required 
to  store  the  results  grow  quickly  as  the  dimensionality 
increases.  Swart  (1985)  has  suggested  some  possible 
refinements  of  the  Chand  and  Kapur  algorithm.  Swart’s 
suggestions  involve  storing  some  intermediate  results 
rather  than  recalculating  them.  Benson  is  working  on 
other  modifications,  but  none  of  these  modifications 
were  used  here. 


3.3  Computation 

An  iterative  algorithm  is  described  by  Titterington 
(1978),  and  modifications  are  given  in  Silvey, 
Titterington  and  Silvey  (1978).  A  noniterative 
algorithm  for  two  dimensional  ellipsoids  is  given  by 
Silverman  and  Titterington  (1980).  The  ellipsoid  is 
described  by  two  parameters,  the  center  ft  and  the 
quadratic  form  Q.  Points  on  the  boundary  of  the 
ellipsoid  satisfy  the  equation 

(x  -  p)TQ(x  -ft)=p 

The  parameters  are  found  by  an  iterative  reweighting 
scheme.  Let  Xi,..jc„  be  n  points  in  Rp  .  The 
algorithm  initially  gives  weight  1/n  to  each  point  and 
then  computes 

=  (3.1) 

0  -  wtfo  -  *)r  (*  -  #»)]  1  (3-2) 

Next  the  ellipsoid  is  checked  for  each  point  by  defining 

*  =  (x.-**)TQ(xr-/») 

and  dmax  =  max(dud2,...dn).  If  dma*  <  p  +  e,  the 
search  stops.  Otherwise  new  weights  are  computed 
according  to  Wi  =  ( di/p)wi  and  these  are  used  to  update 
It  and  Q  according  to  (3.1)  and  (3.2).  In  our  simulation 
study  described  below,  a  value  of  0.01  was  used  for  c. 

4.  SIMULATION  STUDY 

4.1  Design 

Both  the  convex  hull  and  the  minimum  covering 
ellipsoid  were  computed  under  conditions  defined  by 
three  factors: 


3.  MINIMUM  COVERING  ELLIPSES 
3.1  Uses 

The  minimum  covering  ellipsoid  (MCE)  is  the 
smallest  content  ellipsoid  covering  a  set  of  points. 
Titterington  (1975)  introduced  the  MCE  and  described 
its  relationship  to  optimal  design.  The  idea  of  peeling  or 
trimming  multivariate  data  can  be  carried  out  with  the 
MCE  in  place  of  the  convex  hull.  Titterington  (1978) 
discusses  robust  estimation  of  the  correlation  coefficient 
using  elliptical  trimming.  Green  (1981)  discusses  both 
convex  hull  and  elliptical  peeling.  Cook  and  Weisberg 
(1978)  describe  using  the  MCE  to  define  the  region  of 
applicability  (interpolation  region)  of  the  independent 
variables  in  multiple  regression.  The  MCE  can  be 
substituted  for  the  convex  hull  in  Kendall’s  (1966) 
suggestion  for  discrimination^ 


(1)  Distribution  -  Normal  (0, 1) 

Uniform  (on  the  unit  cube) 

(2)  Points  (n)  -  20,  40,  60,  80,  100 

(3)  Dimension  (p)  -  2,  3,  4,  5,  6. 

The  uniform  variates  were  generated  in  a  cube 
of  dimension  p.  The  normal  samples  are  transformed 
versions  of  the  uniform  variates.  AH  computations  were 
done  on  a  VAXl 1/750. 

For  each  combination  of  these  factors  we  observed 
the  time  to  compute  the  convex  hull,  the  number  of  faces 
and  vertices  of  the  convex  hull,  the  time  to  compute  the 
MCE  and  the  number  of  points  on  the  surface  of  the 
MCE. 
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Figure  2.  Computation  Times  for  Convex  Hulls  of 
Random  Uniform  Points  on  the  Unit  Cube.  The  text 
plotted  for  each  result  is  the  dimension,  p. 


Figure  3.  Computation  Times  for  Convex  Hulls 
of  Random  Normal  Points.  The  text  plotted  for  each 
result  is  the  dimension,  p. 
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Figure  2  shows  the  times  required  to  compute 
convex  hulls  for  random  uniform  variates  in  the  unit 
cube.  In  some  cases  only  one  replication  is  shown 
because  three  of  the  runs  ran  out  of  space.  The  runs 
for  100  points  in  6  dimensions  took  over  3  hours  of  cpu 
time.  Figure  3  shows  the  corresponding  times  for  the 
normal  data.  The  time  to  compute  hulls  of  random 
normal  data  is  less,  but  the  runs  for  100  points  in  6 
dimensions  still  take  nearly  3  hours.  In  contrast,  Figure 
4  shows  the  times  for  the  MCE  for  normal  data.  None 
of  the  times  is  more  than  2  minutes.  The  convex  hull 
is  time  consuming  to  compute  with  the  gift-wrapping 
algorithm  because  the  hulls  become  very  complex  with 
many  faces  to  find.  In  order  to  determine  if  a  new  point 
is  in  the  convex  hull,  one  needs  to  retain  information  for 
each  face.  Figure  5  shows  the  numbers  of  faces  for  the 
hulls  of  normal  data.  Even  for  40  points  in  6  dimensions, 
the  convex  hulls  have  over  1000  faces.  For  100  points 
in  6  dimensions,  the  hulls  have  over  3000  faces.  In 
dimensions  2  and  3  the  expected  number  of  faces  can 
be  found  using  results  from  Efron  (1965).  The  expected 
values  are  given  by  stars  on  Figure  5.  The  simulated 
results  are  in  good  agreement  with  the  expected  values. 

Figure  6  shows  the  results  for  the  number  of 
vertices  on  the  hull,  except  that  the  number  of  vertices 
is  converted  to  percent  of  n,  the  number  of  points.  For 
40  points  in  6  dimensions,  for  example,  over  90%  of  the 
points  were  vertex  points  of  the  convex  hull.  Even  for 
100  points  in  6  d:mensions,  about  70%  of  the  points 
are  vertex  points.  In  contrast  Figure  7  shows  that  the 
percentage  of  points  on  the  surface  of  the  MCE  is  much 
smaller.  For  example  only  about  10%  of  the  100  points 
in  p=4  dimensions  are  on  the  surface  of  the  MCE. 


Figure  4.  Computation  Times  for  Minimum 
Covering  Ellipsoids  for  Rmndom  Normal  Points.  The 
text  plotted  for  each  result  is  the  dimension,  p. 
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5.  DISCUSSION 


Figure  5.  Number  of  Faces  for  Convex  Hulls  of 
Random  Normal  Points.  The  text  plotted  for  each  resuit 
is  the  dimension,  p.  A  *  indicates  the  expected  value 
for  2  and  3  dimensions. 
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Convex  hulls  in  higher  dimensions  are  complex  with 
many  faces.  The  time  to  find  the  hull  and  space  required 
to  save  the  results  could  be  enormous.  The  minimum 
covering  ellipsoids  were  found  in  much  less  time,  and 
the  results  require  saving  only  the  center  and  quadratic 
form  of  the  ellipsoid.  In  higher  dimensions  the  convex 
hull  can  be  mostly  vertex  points.  This  has  important 
implications  for  the  potential  use  of  convex  hulls  in 
peeling  or  discrimination.  Clearly,  peeling  is  not  useful 
if  most  of  the  points  are  in  the  outside  layer.  In  a 
discrimination  context,  if  most  of  the  points  are  vertex 
points,  the  expected  probability  content  of  the  hull 
would  be  estimated  to  be  small.  Such  a  hull  would  have 
a  small  chance  of  containing  a  new  point  drawn  from 
the  same  distribution.  More  explicitly,  the  expected 
probability  content  of  a  convex  hull  of  n-1  points  is 
E(V)j  n  where  E(V)  is  the  expected  number  of  vertices 
of  a  hull  of  n  point.  This  is  related  to  the  observation 
that  a  point  is  in  the  convex  hull  of  the  other  n-1  points 
if  and  only  if  it  is  not  a  vertex  point  for  the  hull  of  the 
original  n  points.  From  the  results  here  one  sees  that  a 
very  large  number  of  points  is  going  to  be  needed  in  high 
dimensions  in  order  for  the  convex  hull  to  contain  much 
of  the  probability  space.  By  contrast  the  minimum 
covering  ellipsoid  has  many  fewer  points  on  the  surface 
and  therefore  has  a  much  higher  estimated  probability 
content. 


Figure  6.  Percent  of  Points  on  Surface  of  Convex 
Hulls  of  Random  Normal  Points. 


Figure  7.  Percent  of  Points  on  the  Surface 
of  Minimum  Covering  Ellipsoids  of  Random  Normal 
Points. 
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A  JOHNSON  CURVE  APPROACH  TO 
WARMING  UP  TIME  SERIES  SIMULATIONS 

David  A.  Burn,  IMSL,  Inc. 


The  simulation  of  a  time  series  from  a  specified  autoregressive 
moving  average  (ARMA)  model  requires  knowledge  of  the  ini¬ 
tial  values  of  the  time  series  and/or  innovations  process.  Given 
independent,  identically  distributed  zero  mean  innovations,  the 
start  values  of  the  time  series  are  often  derived  from  a  moving 
average  approximation  of  the  series.  This  approximation  intro¬ 
duces  bias  of  a  transient  nature  into  the  system,  and  requires 
the  simulation  to  be  run  for  a  period  of  time  in  order  to  dimin¬ 
ish  the  influence  of  the  initial  values  of  the  series.  Tb  avoid  the 
necessity  of  warming  up  the  simulation,  we  consider  a  Johnson 
curve  approximation  of  the  distribution  of  the  initial  values  of 
the  time  series. 

KEY  WORDS:  Autoregressive,  Moving  average,  Skewness, 
Kurtosis,  Transient. 

1.  INTRODUCTION 

1.1  General  ARMA  Model 

Define  the  general  form  of  the  ARMA(p,  q)  model  by 

<MB)Wt  =  «0  +  «,(B)A,  (1) 

where 

MB)  =  1-^B-fcB* - *pBr,  P>°  (2) 

9q(B)  =  - 8,B«,  «>0  (3) 

and  B  is  the  backward  shift  operator  defined  by 
B*W,  =  Wt  k,  for  all  k. 

This  definition  includes  the  following  assumptions: 

1.  The  innovations  A*  are  independent  and  identically  dis¬ 
tributed  random  variables  with  mean  zero  and  variance 
°A 

2.  The  autoregressive  operator  <j>p(B)  is  stationary.  Equiv¬ 
alently,  the  roots  of  the  equation  <t>p(B)  =  0  lie  outside 
the  unit  circle. 

3.  The  moving  average  operator  0,(B)  is  invertible.  Equiv¬ 
alently,  the  roots  of  the  equation  9q(B)  =  0  lie  ottside 
the  unit  circle. 

The  model  is  general  in  that  the  constant  term  to  is  included 
to  allow  for  a  nonzero  series  mean  it.  Refer  to  Box  and  Jenkins 
(1976,  pp.  91-93)  for  further  discussion. 

1.2  Equivalent  Representations 

The  random  shock  form  of  the  general  ARMA(p,  q)  is  given  by 
W,=M  +  V-,(B)A,  (4) 

where 

MB)  =  V(B)MS)  =  1  +  *i  B  +  V-jB*  +  •••  .  (5) 

The  V>  weights  of  the  infinite  order  moving  average  may  be 
determined  by  equating  coefficients  of  B  in 


MB)MB)  =  9q(B) 

(see  Box  and  Jenkins,  1976,  pp.  93-96).  The  random  shock 
model  is  particularly  useful  since  the  moments  of  the  time  series 
Wt  can  be  derived  for  a  specified  distribution  of  the  innovations 
A|.  Note  that  the  general  ARMA(p,  q)  model  and  its  random 
shock  form  may  be  equivalently  expressed  as 

MB)  W,  =  9q(B)A,  (6) 

and 

wt  =  MB)A,  (7) 

respectively,  where  W ,  =  Wt  -  it  corresponds  to  a  time  series 
with  zero  mean. 

1.3  The  Simulation  Problem 

Suppose  we  wish  to  simulate  a  time  series  Wt  of  length  n 
according  to  a  specified  ARMA(p,  q)  model.  The  induction 
period  is  the  length  of  time  required  to  minimize  the  transient 
bias  induced  by  starting  the  run  (Anderson,  1975).  Let  the 
total  number  of  generated  observations  of  the  time  series  be 
m  +  n  where  the  simulation  is  warmed  up  with  m  discardable 
observations  of  the  time  series.  Clearly,  we  desire  as  short  an 
induction  period  m  as  possible. 

The  method  of  generating  the  initial  values  of  the  time  se¬ 
ries  and  innovations  process  required  to  start  the  simulation 
run  directly  affects  the  induction  period.  Since  the  innovations 
are  assumed  to  be  independent,  the  moving  average  part  of  the 
model  may  be  easily  initialized  with  q+ 1  pseudo-random  num¬ 
bers  from  the  specified  distribution  of  A,.  However,  the  Wt  are 
not  independent,  so  that  production  of  the  p  series  start  values 
is  a  major  problem  of  the  simulation  experiment.  We  identify 
two  general  approaches  to  this  problem: 

Approach  A  Generate  the  initial  series  values  from  an 
approximation  of  the  model  of  the  time  series. 

Approach  B  Generate  the  initial  series  values  from  an 
approximation  of  the  joint  distribution  of  the  time  series. 

Other  methods  have  been  proposed;  for  example,  see  Piccolo 
and  Wilson  (1984). 

A  prototypical  algorithm  to  simulate  a  time  series  consists 
of  the  following  steps: 

ALGORITHM  ARMA 

1.  Generate  Wi-P,  Wj.,,,  . . .,  Wo. 

2.  Generate  Ar_,,  Aj_ ,,  . . .,  A0. 

3.  Set  t=l. 

4.  Generate  At. 

5.  Compute  Wt  using  ARMA(p,  q)  model. 
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6.  Set  t  =  t+  1. 

7.  Repeat  Step  4  through  Step  6  until  t  =  m  +  n. 


where  A  represents  the  innovations  A Am+n.  Then 

W(  =  ff,(m,n,*)W0  +  h,(m,n,*,A)  (13) 


The  desired  time  series  is  given  by  W<  for  t  =  m  + 1,. . . , ro-f  n. 
Some  particular  applications  of  Approach  A  and  Approach  B 
are  next  examined. 


where 


=  *m+1  *‘-‘ 


(1-^)1 

n(l  -  * ). 


(14) 


2.  WARMING  UP 

2.1  A  Finite  Approximation 

One  version  of  Approach  A  utilizes  a  finite  order  moving  aver¬ 
age  approximation  of  the  general  ARMA(p,  q)  model, 

W,  =  (8) 

f=o 

where  m  is  chosen  sufficiently  large.  (We  assume  9q  --  0  in  this 
section.)  The  algorithm  consists  of  the  following  steps: 

ALGORITHM  A 


1.  Determine  m. 

2.  Generate  m  +  p  pseudo-random  numbers  Ai_(m+P), 
Aj— (m+p),  . . .,  Ao  from  the  innovations  distribution. 

3.  Construct  p  series  start  values  Wi_p,  W2_p,  . . .,  Wo  using 
the  MA(m)  model  (8)  and  the  innovations  bom  Step  2. 
Discard  the  innovations. 

Often  m  is  chosen  arbitrarily  and  the  simulation  may  be  warmed 
up  longer  than  necessary. 

2.2  Determination  of  Induction  Period 

A  “precise”  method  of  determining  the  optimal  length  of  the 
induction  period  was  proposed  by  Anderson  (1979).  Consider 
the  AR(1)  model 

W,  =  *Wt-,+A,,  |*|  <  1  (9) 

and  its  associated  MA(m)  approximation 

W,  =  £*M,_,.  (10) 

l=o 

Anderson  (1975)  states  that  m  should  be  chosen  such  that  the 
variance  of  the  transient  bias 

( £  (ii) 

is  sufficiently  small.  However,  this  approach  may  be  deceptive 
and  lead  to  sn  excessively  long  induction  period  (Anderson, 
1979). 

Instead,  Atderson  considers  W|  =  W(  —  W  where  W|  is  the 
AR(1)  series  generated  by  ALGORITHM  ARMA  and 


and  the  function  ht(m,  n,  *,  A)  is  a  not  affected  by  the  behavior 
of  the  series  start  value  Wo.  Hence,  the  dependence  of  the 
simulated  series  upon  the  initial  series  value  is  minimized  by 
selecting  m  such  that 

|ff,(m,n,*)|  <  e,  €>0.  (15) 


Since  /f,(m,n,*)  is  maximized  at  t  =  1,  the  induction  period 
m  is  determined  by  the  inequality 


-H-SrSI)  , 

In  |*| 


t  >  0. 


(16) 


For  a  given  e  >  0,  m=  m(*,  n)  is  a  function  of  the  parameter 
*  and  the  length  of  the  simulated  series  n. 

Anderson  (1979)  states  that  this  method  extends  to  the 
general  ARMA(p,  q)  model.  Similar  to  (13),  we  have 


W,  =  E/fuKn,*. . 

3=1 

+  XT  •  •  >*p.®  . . 0q>A)  (17) 

k= 0 

where  A  represents  the  initial  moving  average  innovations  Ai_, , 

Aj_,, . . .,  Ao  in  addition  to  the  innovations  Ai . A*+ The 

induction  period  m  is  dependent  upon  the  length  of  the  simu¬ 
lated  series  n  through  the  functions  j, . .  .,*,),  and 

is  minimized  when  their  sum  is  negligible.  However,  the  form 
of  this  dependence  is  quite  complicated  for  models  with  p  >  1. 

The  need  to  warm  up  the  simulation,  and  hence  the  need 
to  select  an  optimal  value  of  the  induction  period,  are  artifacts 
of  the  moving  average  approximation  of  the  series  start  values. 
To  bypass  both  of  these  problems,  we  derive  a  method  of  con- 
stucting  the  initial  values  of  the  series  directly  from  the  joint 
distribution  of  the  time  series. 


3.  THE  JOHNSON  SYSTEM 

3.1  Methods  of  Translation 

To  provide  a  mathematical  representation  of  a  wide  variety  of 
statistical  distributions,  Johnson(1949)  proposed  a  family  of 
frequency  curves  generated  by  methods  of  translation.  Let  z 
denote  a  random  variable  whose  distribution  we  wish  to  ..«>del, 
and  let  z  represent  a  standard  normal  random  variable.  Con¬ 
sider  the  infinite  class  of  transformations 


w  = 


I  £  W, 

l=m+l 

*"+'(i  _  +*) 
n(l  -  *) 


Wo  +p(m,n,*,  A) 


(12) 


where  /  is  a  monotone  function  of  z  and  is  dependent  only 
upon  fixed  parameters.  Define  the  standard  form  of  this  trans¬ 
formation  to  be 


z  =  7  +  bf(y) 
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where 


V  =  (.-0/A- 

MW) 

Then  the  density  function  of  y  is  given  by 

1 

MW) 

jKv)  =  */'(v)jK*) 

(18) 

MW) 

«  ^=/'(»)e*p{-|  fo  +  */Mp} 

(19) 

and  has  the  same  shape  as  the  density  function  of  *. 

The 

MW) 

parameters  f  and  A  are  location  and  scale  factors,  respectively, 
and  the  parameters  6  and  7  affect  the  skewness  and  kurtoeis, 
respectively. 


3.2  Systems  of  Interest 
The  special  systems  described  by  Johnson  are 
1.  Lognormal,  St 


(nr) 

2.  Bounded,  Sg 

3.  Unbounded,  Sir 

t  =  T  +  , 


(  <  1  <  00. 


{<*<{  +  A. 


-00  <  *  <  00. 


These  three  systems  encompass  most  of  the  distributions 
common  to  statistical  analysis.  For  information  concerning  al¬ 
ternative  systems  of  distributions,  see  Johnson  (1949),  Elderton 
and  Johnson  (1965),  Kendall  and  Stuart  (1969),  and  Ord  (1972). 


3.3  Fitting  by  Moments 

To  fit  a  Johnson  curve,  we  derive  the  mean,  standard  devia¬ 
tion,  coefficient  of  skewness,  and  coefficient  of  kurtoeis  of  the 
specified  distribution.  Next,  the  appropriate  system  and  pa¬ 
rameters  of  the  Johnson  curve  representation  at  the  specified 
distribution  are  determined  using  algorithm  AS  99  (Hill,  Hill, 
and  Holder,  1985).  Pseudo-random  observations  from  the  fitted 
Johnson  curve  are  obtained  by  transforming  a  pseudo-random 
standard  normal  variate  to  a  Johnson  variate  using  algorithm 
AS  100  (Hill,  1985). 

For  the  simulation  experiment,  we  require  only  the  first  four 
moments  of  the  theoretical  distribution  to  determine  a  corre¬ 
sponding  representation  within  the  Johnson  system.  Since  the 
momenta  of  the  distribution  to  be  modelled  are  calculated  the¬ 
oretically,  no  sampling  error  is  introduced.  Alio,  we  are  mainly 
interested  in  first  and  second  order  properties  of  the  simulated 
time  series.  Hence,  we  view  the  method  of  moments  approach 
to  fitting  a  Johnson  curve  to  be  acceptable.  A  number  of  al¬ 
ternative  methods  of  estimating  the  parameters  of  a  Johnson 
curve  are  discussed  by  Johnson  (1949),  Elderton  and  Johnson 
(1965),  and  Ord  (1972). 


3.4  Series  Moments 

The  relationship  between  the  central  moments  of  the  time  series 
and  the  central  moments  of  the  innovations  process  is  given  by 


1= 0 


1=0 


1=0 


1=0 


=  {*(A)  -  3[p,(A)]*}  £>J  +  3MA)1*  (?>/) 

1=0  Vi=°  / 


where  p,(W)  and  t*<(A)  denote  the  1  th  central  moment  of  Wt 
and  At  respectively,  for  i  =  1,2,3, 4.  The  mean,  standard 
deviation,  coefficient  of  skewness,  and  coefficient  of  kurtoeis  of 
the  distribution  of  W,  are  defined  by 


M  =  Ml(K') 

o  =  IMW))1/* 

VFi  =  »(»n/M"0],/* 

A  =  m(iF)/M«0]*- 


Hence,  for  independent  and  identically  distributed  (0,  o\)  in¬ 
novations, 


P  =  0 


»/* 


o  =  oA\ 


(§*■') 


(20) 

(21) 


(zr=o*ir 


(22) 


A  =  [A(A)-3] 


(EM)1 


+  3. 


(23) 


The  stationary  and  invertibility  assumptions  imply  that  the 
infinite  sums  of  the  V>,  are  absolutely  convergent.  Davies, 
Spedding,  and  Watson  (1980)  state  that  for  low  order  ARMA 
models,  at  most  30  of  the  V>,-  are  required  to  compute  the  above 
moments. 


4. 


GENERATION  OF  START 
VALUES 


4.1  A  Simple  Example 


We  introduce  the  Johnson  Curve  approach  to  generating  the 
initial  values  of  a  time  series  in  the  context  of  the  AR(1)  model 


Wt  =  +Wt.t  +  At, 

and  its  equivalent  random  shock  form 


W,  =  '£^A,.jt 
l=o 


|4|  <1 

(24) 

141  <  1. 

(25) 

Setting  V1;  =  P  in  (21)  through  (23)  gives 

,»/* 


o  =  oA 


(nbO 
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VK  -  i/£w[fir3£] 

h  - 

The  start  value  of  the  AR(1)  model  may  then  be  obtained  by 
fitting  a  Johnson  curve  and  generating  a  Johnson  variate  as 
previously  described. 

4.2  A  Joint  Distribution 

Let  P(Wi+i,Wi+i . W,+r)  represent  the  joint  probability  of 

p  consecutive  elements  of  the  time  series  Wt.  Let  u>t  denote 
a  particular  realisation  of  the  element  Wt.  The  definition  of 
conditional  probability  implies 

P(Wi+l,Wi+i,...,Wi+,)  = 

P(Wi+l)P(Wi+t  I  wt+l)P(WM  I  «*+,,«*+,)  — 

•••P(Wi+ P  |  Wi+l)  Wj+I,. .  ■  ,W|+p-l). 

Each  term  on  the  right  hand  side  corresponds  to  a  univariate 
distribution  which  may  be  approximated  by  a  member  of  the 
Johnson  system.  The  observed  values  w1+J  for  j  =  l,...,p 
may  therefore  be  obtained  from  the  joint  distribution  of  Wi  by 
constructing  Johnson  curve  approximations  to  successive  distri¬ 
butions,  each  conditional  on  the  previously  generated  observed 
values  ui|. 

4.3  The  General  Algorithm 

We  now  develop  a  procedure  to  generate  the  series  start  values 
for  the  general  ARMA(p,  q)  model.  Define 


= 

'  l«o. 


-i>  *=1.2 . P-1 


i  -  Ej=i  h 

Let  1  <  k  <  p  and  consider  the  general  ARMA(i,  q)  model 

<h.(B)Wt=»ok  +  »,(B)At  (28) 

and  its  equivalent  random  shock  form 

w,  =  P*  +  (29) 


where  4fc(£),  *«(£)>  and  V't(B)  are  defined  as  in  (2),  (3),  and 
(5)  respectively.  The  algorithm  to  generate  the  initial  values  of 
the  time  aeries  for  the  general  ARMA(p,  q)  model  via  Approach 
B  consists  of  the  following  steps: 

Algorithm  B 

1.  Determine  and  fa{A)  for  a  specified  innovations 

distribution  with  mean  0  and  variance  a\. 

2.  Set  k  =  p. 

3.  Compute  p*  using  (27). 

4.  Compute  »,  y/Wi,  and  ft  using  (21)  through  (23)  and  the 
result  of  Step  1. 


5.  Determine  the  type  of  Johnson  curve  and  the  parameters 
£,  A,  S,  and  7  using  algorithm  AS  99  given  the  results  of 
Step  4. 

6.  Generate  a  peeudo-random  number  from  the  standard 
normal  distribution. 

7.  Apply  the  inverse  transformation  of  algorithm  AS  100 
to  compute  a  pseudo-random  Johnson  variate  using  the 
Johnson  curve  determined  in  Step  5  and  the  result  of  Step 
6. 

8.  Compute  uq_*  by  adding  p*  from  Step  3  to  the  pseudo¬ 
random  Johnson  variate  from  Step  7. 

9.  Set  k  =  *  -  1. 

10.  Repeat  Step  3  through  Step  9  until  *  =  0. 

The  values  wj_,,  . . .,  wo  constitute  an  observation  from 
the  joint  distribution  of  p  consecutive  dements  of  the  time 
series  Wt. 

6.  DISCUSSION 

We  have  considered  two  approaches  to  warming  up  time 
series  simulations.  Approach  A,  as  implemented  through 
ALGORITHM  A,  approximates  the  general  ARMA(p,  q)  model 
by  a  finite  order  moving  average  in  order  to  determine  the  p 
start  values  of  the  series.  The  transient  bias  introduced  by 
beginning  the  simulation  run  with  approximate  start  values 
requires  an  induction  period  of  m  observations.  Tb  avoid  un¬ 
necessary  warming  up  of  the  time  aeries,  an  optimal  value  of  m 
may  be  determined  using  the  method  of  Anderson  (1979).  This 
value  is  dependent  upon  the  length  of  the  time  series  to  be  simu¬ 
lated  as  well  as  the  autoregressive  parameters  of  the  model.  Al¬ 
though  this  method  extends  to  the  general  ARMA(p,  q)  model, 
the  complexity  of  the  expression  for  the  minimal  induction 
period  may  prohibit  its  use  in  practice. 

Since  the  source  of  the  difficulty  with  Approach  A  lies  in 
the  approximation  used  to  obtain  the  start  values  of  the  series, 
we  proposed  a  method  to  directly  generate  these  observations 
from  the  joint  distribution  of  the  time  series,  called  Approach 
B.  This  approach  may  be  implemented  using  the  strategy  of 
ALGORITHM  B.  For  a  specified  innovations  distribution,  the 
mean,  standard  deviation,  skewness,  and  kurtosis  of  the  time 
series  are  computed  for  general  ARMA(A,  q)  models  of  decreas¬ 
ing  order  in  *.  At  each  value  of  k,  the  distribution  of  the  time 
series  is  approximated  by  a  Johnson  curve  and  a  start  value 
is  generated.  Using  the  definition  of  conditional  probability, 
the  start  values  are  generated  in  successsion,  and  together  con¬ 
stitute  an  observation  from  the  joint  distribution  of  the  time 
series.  In  practice,  ALGORITHM  B  may  be  applied  with  a 
moderate  amount  of  warming  up  to  compensate  for  fitting  the 
distribution  based  on  the  first  four  moments. 

The  heuristic  descriptions  of  Approach  A  and  Approach 
B  involve  approximations  of  the  time  series  model  and  of  the 
joint  distribution  of  the  time  series,  respectively.  Comparison 
of  these  approaches  may  be  performed  with  respect  to  their 
implementation  in  ALGORITHM  A  and  ALGORITHM  B.  In 
particular,  ALGORITHM  B 

e  requires  no  explicit  warming  up  period; 

e  does  not  depend  on  the  length  of  the  simulated  series; 


•  readily  extend*  to  the  general  ARMA(p,  f)  model. 

Further  reeearch  into  the  properties  of  both  approaches  and 

algorithm*  i*  in  progress. 
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Censored  Discrete  Data  and  Generalised  Linear  Models 
Douglas  B.  Clarkson,  IMSL  Inc. 


Abstract 

Generalised  linear  models  in  discrete  data  encompas¬ 
ses,  among  other  models,  logistic  regression  models,  pro¬ 
bit  models,  and  Poisson  regression  models.  This  paper 
discusses  an  algorithm  for  computing  parameter  estimates 
in  such  models  when  interval  (and  other)  censoring  is 
present  in  the  data.  Also  discussed  are  some  solutions 
to  problems  encountered  in  the  algorithm,  along  with  the 
statistical  implications  of  some  forms  of  model  degener¬ 
acy.  Finally,  censored  data  analogues  of  some  common 
non-censored  data  graphical  techniques  and  statistics  are 
given. 

1.0  Introduction 

This  paper  discusses  some  experiences  gained  when 
implementing  subroutine  CTGLM  (CaTegorical  General¬ 
ized  Linear  Models)  for  inclusion  in  the  IMSL  libraries. 
Although  concern  will  be  with  linear  models  in  the  sense 
of  Nelder  and  Wedderburn  (1972)  or  McCullagh  and  Nelder 
(1983),  their  terminology  will  not  be  used.  The  main 
advantage  that  CTGLM  seems  to  offer  over  similar  sub¬ 
routines  is  the  ability  to  handle  censored  (right,  left,  or 
interval)  data  directly.  This  ability  causes  some  prob¬ 
lems  in  the  usual  algorithms  and  in  the  usual  analysis. 
Discussion  will  center  around  how  these  problems  can  be 
resolved. 

2.0  The  Models 

CTGLM  handles  four  discrete  distributions  and  a  to¬ 
tal  of  six  models.  User  specified  models  (without  censor¬ 
ing)  are  handled  by  another  routine.  Let  z,-  denote  a  row 
vector  of  covariates,  /?  denote  a  column  vector  of  param¬ 
eters,  n,-,  denote  the  binomial  sample  size,  r,-  denote  the 
number  of  successes  in  the  negative  binomial,  A ,■  denote 
the  Poisson  parameter,  denote  the  probability  of  suc¬ 
cess  in  a  single  Bernouli  trial,  denote  the  realization 
of  the  random  variable,  and  let  $  denote  the  cumulative 
standard  normal  probability  distribution  function.  Then 
the  possible  models  are  given  as: 

1.  Binomial,  /(tt/ni,f<,z<),  with  three  models  for 

(a)  logistic: 

(b)  Probit:  0,  =  $(exp(x,/3)). 

(c)  Log-log:  9i  =  exp(-  exp(z,/?)). 

2.  Poisson:  /(y</ A*,**),  A<  =  exp(x<0). 

3.  Negative  Binomial:  /(*»/•<,  r^x,), 

4.  Logarithmic:  f{yi/0i,xi),  where  0(  = 


In  all  models,  left  censoring  is  the  same  as  interval 
censoring  with  a  left  endpoint  of  zero.  In  the  binomial 
models,  right  censoring  is  the  also  the  same  as  interval 
censoring  with  a  right  endpoint  of  n,-.  Note  that  the  co¬ 
variates  Xi  may  be  (and  usually  are)  vector  valued. 

3.0  Example 

The  following  table  helps  to  illustrate  the  logistic  model 
for  interval  censored  data.  An  interval  for  the  number 
of  deaths  at  a  given  dose  level  and  sample  size  is  given, 
along  with  the  maximum  likelihood  estimates  of  the  bi¬ 
nomial  probabilities,  fit,  and  the  estimated  probability  of 
the  observation  (censoring  interval)  Throughout  this 
paper  it  is  assumed  that  censoring  mechanism  operates 
independently  of  the  binomial  probability  and  of  the  out¬ 
come.  Other  censoring  mechanisms  may  be  possible.  See, 
e.g.,  Kalbfleisch  and  Prentice  (1980). 

Table  1 
An  Example 


Dose 

Number 

Deaths 

Sample 

Size 

Censoring 

Type 

#« 

/. 

1 

0-3 

r-  100 

Left 

.0124 

0.964 

2 

7-15 

100 

Interval 

.1021 

0.849 

3 

40-60 

100 

Interval 

.5068 

0.963 

4 

80-100 

100 

Right 

.9028 

0.999 

When  a  binomial  model  is  fit  to  the  data  with  a  simple 
linear  logistic  model  on  x,  =  dose,  one  obtains  the  maxi¬ 
mum  likelihood  estimate  for  the  intercept  as  =  —6.576 
with  slope  estimate  pi  =  2.201.  The  usual  asymptotic 
statistics  may  also  be  computed.  The  estimates  %  of  the 
estimated  ‘cell’  probabilities  are  obtained  from  the  maxi¬ 
mum  likelihood  estimates. 

The  data  for  this  example  may  arise,  for  example,  in 
an  experiment  on  an  insecticide.  Insects  may  be  censored 
because  they  die  for  reasons  unrelated  to  the  insecticide, 
and  before  the  effect  of  the  insecticide  can  be  assessed. 

4.0  The  Algorithm 

Let  r/i  =  Xi /)  and  note  that  the  derivatives  of  /,  in 
the  following  are  with  respect  to  t)i,  and  that  /<  and  its 
derivatives  are  evaluated  at  When  an  observation  is 
censored,  the  function  is  the  sum  over  the  censoring 
interval  of  the  probability  distribution.  Otherwise,  /,  is 
the  probability  of  the  single  observed  outcome.  The  log- 
likelihood  and  derivatives  are  computed  as  follows: 


i  =  i<*w =£*<*(/<) 

i-l 

N  P**  N 

4  =  §  *  5“** 

* -  s[f-(f)’]‘'*!*S”,®l! 

where  AT  is  the  number  of  observations  (rows  in  X). 

Newton-Raphson  iteration  can  be  implemented  via 
weighted  least  squares  with 

1.  weights  Wu, 

2.  independent  variable  ii,  and 

3.  dependent  variable 

The  Newton-Raphson  step  is  obtained  as  the  vector  of 
estimated  regression  parameters.  Alternatively,  scoring 
can  be  implemented  via  weighted  least  squares  with 

1.  weights  wfj, 

2.  independent  variable  xf,  and 

3.  dependent  variable 

CTGLM  always  begins  with  scoring.  When  the  rel¬ 
ative  change  in  the  likelihood  from  one  iteration  to  the 
next  is  small  enough,  the  algorithm  switches  to  Newton- 
Raphson  iteration.  Step  halving  is  used  for  the  line  search 
whenever  the  likelihood  does  not  increase  with  the  initial 
step.  The  weighted  least  squares  estimates  are  computed 
via  Givens  rotations.  McCullagh  and  Nelder  (1983),  and 
Stirling  (1984),  among  others,  discuss  the  same  or  similar 
algorithms.  See  also  Green  (1984). 

5.0  A  Convergence  Problem 

A  problem  with  the  algorithm  which  is  more  common 
in  censored  data  is  that  one  or  more  of  the  estimated 
parameters  ft  =  x,fi  may  be  infinite.  As  an  example, 
consider  the  logistic  model  for  a  single  observation.  If  the 
observation  is  such  that  ft  =  n<  (i.e.,  if  the  number  of  suc¬ 
cesses  equals  the  number  of  trials),  or  if  the  observation 
is  censored  and  the  censoring  interval  contains  ft,  then 
the  maximum  likelihood  estimate  of  the  binomial  prob¬ 
ability  is  =  1.  In  the  logistic  model  this  corresponds 
to  ^  =  x,fi  =*  oo,  obviously  an  extreme  situation.  Al¬ 
ternatively,  if  the  observation  (or  the  censoring  interval) 
contains  0,  then  =  —  oo  is  obtained.  These  types  of  ex¬ 
tremes  are  more  common  in  censored  data,  but  they  can 
occur  in  uncensored  data. 

With  more  than  one  observation,  this  same  type  of 
extreme  will  occur  if  the  i-th  observation  is  ft  (or  0)  and, 
additionally,  the  i-th  row  of  the  design  matrix  X  (x,)  is 
linearly  independent  of  the  remaining  rows  in  X.  More 
generally,  if  all  rows  in  X  corresponding  to  a  group  of  right 
(or  left)  censored  observations  are  linearly  independent  of 


the  remaining  rows  in  X,  then  the  corresponding  to 
these  rows  will  be  oo  (— oo). 

To  see  how  easily  such  extremes  can  occur  in  prac¬ 
tice,  consider  the  example  in  section  3.0  with  a  one  one¬ 
way  ANOVA  model  replacing  the  simple  linear  regression 
model  for  the  parameter  ft.  In  this  model,  the  four  ob¬ 
servations  are  linearly  independent  of  each  other,  so  the 
estimated  parameter  at  dose  1  will  be  =  0  which  cor¬ 
responds  to  =  -oo,  while  dose  4  will  have  parameter 
estimate  0«  =  1  which  corresponds  to  tj4  =  oo.  Estimates 
for  the  regression  parameters  fi  will  depend  upon  the  par¬ 
ticular  parameterization  used  in  the  ANOVA  model,  but 
regardless  of  the  parameterization  used,  most,  if  not  all, 
of  the  parameters  estimates  &  will  be  infinite  and  the 
iterative  algorithm  will  fail  to  converge.  Note,  however, 
that  maximum  likelihood  estimates  for  the  observation 
probabilities,  0.,  are  well  defined. 

To  account  for  the  possibility  of  infinite  fi,  CTGLM 
uses  restricted  maximum  likelihood  estimation  as  follows: 

Each  observation  (censored  or  otherwise)  is 
restricted  such  that  its  estimated  probability 
(/<)  is  less  than  0.9999.  Whenever  becomes 
0.9999  or  greater  then  the  observation  is  omit¬ 
ted  from  the  likelihood  (until  its  probability 
becomes  less  than  0.9999). 

Note  that  restricting  the  /.-  also  restricts  the  param¬ 
eters  0i  through  ft.  In  effect  the  norm  of  the  /?,•’ s  is  not 
allowed  to  become  too  large.  While  one  could  restrict  the 
fii'a  directly,  it  is  more  natural  to  restrict  /*.  Moreover, 
the  statistical  properties  of  the  resulting  estimaters  are 
clearer. 

It  is  important  to  note  that  it  is  the  probability  of  the 
observation,  and  not  the  binomial  parameter  0,,  which  is 
being  restricted.  Indeed,  in  a  binomial  model  a  right  cen¬ 
sored  observation  may  have  a  current  estimate  for  0  of 
0.7  or  less,  but  because  of  the  censoring,  the  probability 
of  the  observation  (i.e.,  the  sum  of  the  binomial  probabil¬ 
ities)  can  be  very  close  to  one. 

In  the  following  the  log-likelihood  in  which  observa¬ 
tions  with  probabilities  near  one  have  been  eliminated  is 
called  the  ‘reduced  likelihood’.  The  ‘restricted  likelihood’ 
is  the  log-likelihood  one  obtains  when  the  restrictions  on 
the  fi  are  applied.  CTGLM  optimizes  the  reduced  like¬ 
lihood,  not  the  restricted  likelihood.  However,  it  is  easy 
to  show  that  a  local  optimum  of  the  reduced  likelihood 
is  also  a  local  optimum  for  the  restricted  problem.  To 
see  this,  let  U(fi)  =  <oy(/,),  and  denote  the  constraints  as 
li(fi)  <  log(  1  -  e)  =  -6,  for  t,6  >0.  Define  ‘Lagrange 
multipliers’  Hi  =  1  if  the  s-th  observation  is  restricted, 
with  in  =  0  otherwise.  The  restricted  log-likelihood,  Ir, 
involves  the  Hi  and  is  given  as 

Ir  =  -  X)/*i(M$)  +  &)' 

1=1  1=1 

In  the  restricted  log-likelihood,  both  fi  and  h  must  be 
estimated.  With  the  choice  for  the  Hi  above,  the  re¬ 
stricted  log-likelihood,  In,  yields  the  same  estimates  as 


the  reduced  likelihood  since  the  restricted  observations 
are  eliminated  from  V L  (see  below)  in  both.  All  that  re¬ 
mains  is  to  show  that  the  ft  chosen  above  yield  a  local 
optimum  for  the  restricted  likelihood.  Using  the  Kuhn- 
Tucker  second  order  sufficiency  conditions  (Luenberger, 
1984,  pages  316-317),  this  amounts  to  showing  that  the 
chosen  ft,  i  =  1, . . . ,  N,  are  such  that 

ia(li  +  6)  =  0, 


VL  =  £vti0)-'E*Vli[p)=  0, 

•—1  «=1 

and  that  the  Hessian  matrix 

N  N 

=  £V*/,(4)-I>^(4) 

i=i  «=i 

is  of  full  rank  on  the  space  orthogonal  to  Hi 
Because  of  the  form  of  the  Hessian  matrix,  this  latter 
assumption  is  equivalent  to  restricting  f)  to  a  space  or¬ 
thogonal  to  the  rows  ft  in  X  for  which  Hi  =  l. 

Because  of  the  choice  for  the  ft’s,  and  because  a  local 
optimum  for  the  reduced  likelihood  is  assumed,  the  last 
two  conditions  above  are  clearly  satisfied.  It  remains  only 
to  show  that  ft  (I,-  +  £)  =  0  for  all  i.  This  is  trivial  to 
show  ifft  =  0.  If  ft  =  1  then  1,(4)  is  restricted  and  again 
li  +  6  =  0.  Thus,  for  probabilities  near  one,  the  restricted 
and  the  reduced  likelihoods  are  identical.  (Note,  however, 
that  in  both  the  4.  are  not  uniquely  defined  because  the 
Hessian  is  singular.) 

As  an  example  of  estimates  obtained  from  the  algo¬ 
rithm,  parameterize  the  ANOVA  model  for  the  example 
above  as  follows:  Let  ft  =  Po  +  Tfj  for  j  =  doses  1,  2, 
and  3,  while  t?4  =  /Jo.  Then  the  estimated  parameters  are 
4o  =  2.8,  and  =  (—10.1,— 4.9,— 2.8).  These  es¬ 

timates  give  estimated  observation  binomial  parameters 
of  (0.0007,  0.1072,  0.5000,  0.9422). 

6.0  Some  Considerations  about  the  Deviance 

The  deviance  d,  (Nelder  and  Wedderburn,  1972,  page 
375)  for  each  observation  is  given  as 

(k  =  2(maxgl,(xi())  -  l,(ft4)), 

i.e.,  it  is  twice  the  maximizing  log-likelihood  of  the  obser¬ 
vation  with  respect  to  the  parameter  ft  (=  ft/?,)  minus 
the  likelihood  of  the  observation  as  obtained  from  4-  In 
the  absence  of  restrictions,  the  deviance  has  N  -  p  de¬ 
grees  of  freedom,  where  p  is  the  number  of  parameters  in 
/J.  The  total  deviance  is  given  as  the  sum  of  the  d,  ,  i.e., 
as: 

D=Yidi  =  52*(*uPMxi0)  -  *«(**&))• 

i=l  i- 1 

The  deviance  of  each  observation  may  be  used  as  a 
‘residual’  (see  below).  The  total  deviance  may  be  used 
in  an  asymptotic  chi-squared  goodness  of  fit  test  of  the 
model  (see  McCullagh  and  Nelder  (1983)),  with  N  -  p 
degrees  of  freedom. 


The  definition  of  the  deviance  need  not  change  in  cen¬ 
sored  data.  The  d,  will  become,  however,  more  difficult  to 
compute  because  a  closed  form  solution  for  the  optimising 
li(Xifl)  is  usually  not  available  in  censored  data. 

The  degrees  of  freedom  for  the  deviance  when  there 
are  restricted  observations  must  also  be  adjusted.  Clearly 
one  degree  of  freedom  should  be  subtracted  from  the  to¬ 
tal  degrees  of  freedom,  N,  for  each  restriction  applied. 
However,  because  the  restrictions  on  1,  also  restrict  the 
parameters  4,  the  degrees  of  freedom,  p,  for  the  number 
of  estimated  parameters  may  also  decrease.  Let  r  denote 
the  number  of  linearly  independent  rows  in  the  matrix 
formed  from  the  restricted  ft.  Then  r  is  the  number  of 
restrictions  placed  on  the  parameters  4,  and  the  total  de¬ 
grees  of  freedom  in  the  deviance  is  N— p— q+r,  where  q  is 
the  number  of  restrictions  placed  upon  the  observations. 
(Note  that  in  the  binomial  distribution  each  Bernouli  trial 
is  an  observation  and  contributes  to  N.) 

The  degrees  of  freedom  in  the  deviance  is  data  depen¬ 
dent  and  thus  a  random  variable.  More  restrictions  will 
be  applied  in  some  samples  than  in  others.  These  chang¬ 
ing  degrees  of  freedom  will  affect  the  chi-squared  goodness 
of  fit  test.  Whether  the  restricted  degrees  of  freedom  are 
a  better  predictor  of  the  adequacy  of  the  adequacy  of  the 
asymptotic  approximations  used  throughout  the  analysis 
is  another  question  needing  study. 

7.0  Residual  Plots 

Pregibon  (1981)  gives  two  methods  for  defining  resid¬ 
uals  which  are  of  interest  here.  The  first,  involving  the 
deviances,  is  given  as: 


r«  =  sign(r)i  -  ft4)\/*. 


where  dt  is  the  component  of  the  deviance  as  discussed 
above,  while  ft  is  the  optimising  ft  for  the  single  obser¬ 
vation  in  question.  Clearly,  ru  =  0  for  restricted  observa¬ 
tions.  In  CTGLM,  it  would  be  desirable  if  one  could  avoid 
computing  the  d,  since  their  computation  may  be  expen¬ 
sive,  especially  in  censored  data  where  the  computation 
of  ft  may  require  iteration  for  each  observation. 

The  second  method  of  defining  residuals  is  given  by 
Pregibon  (1981,  pages  708-709)  as: 


Clearly  rj<  is  easy  to  compute  with  censored  as  well 
as  ‘exact’  data.  Indeed,  r*,,  is  obtainable  from  quantities 
already  computed  in  the  iterative  algorithm.  Moreover,  it 
is  possible  to  show  that  r*  is  a  ‘one-step’  approximation 
to  r„.  Thus,  one  would  expect  rlt  and  r*  to  be  very  close 
for  ‘small’  residuals,  and  not  so  close  for  ‘large’  residuals. 
This  is  born  out  in  the  figure  below,  in  which  both  types  of 
residuals  are  plotted  versus  the  index  number  for  the  data. 
The  data  is  taken  from  Pregibon  (1981),  who  attributes  it 
to  Finney  (1947),  and  gives  similar  plots.  In  this  figure,  at 
least,  the  two  types  of  residuals  seem  almost  equivalent. 
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Because  r*  is  cheap  to  compute,  while  is  not,  r*  is  the 
residual  used  in  CTGLM.  Similar  figures  were  obtained 
for  other  data  sets. 

Figure  1 

The  Residuals  (x-rj,  o-fj) 


8.0  A  Monte  Carlo  Study 

The  effect  of  censoring  on  the  estimated  parameters 
and  the  resulting  residuals  is  also  of  interest,  as  is  whether 
one  can  compare  ‘censored’  residuals  with  ‘uncensored' 
residuals.  In  an  attempt  to  answer  these  questions,  a 
small  Monte  Carlo  study  was  performed  using  a  simple 
linear  logistic  regression  model.  Two  factors  were  stud¬ 
ied.  The  first  factor  was  the  censoring  level.  Three  levels 
of  censoring  (0%,  10%,  and  30%)  were  used.  The  sec¬ 
ond  factor  was  concerned  with  the  appropriateness  of  the 
simple  linear  model.  Data  generated  for  level  one  of  this 
factor  fit  the  model  for  covariate  values  of  (-3,  -0.5,  0.5, 
and  2.0)  corresponding  to  binomial  probabilities  of  (0. 1 19, 
0.378, 0.622,  and  0.881).  In  the  second  level  the  same  co¬ 
variate  values  were  used,  but  the  probability  0.662  for  the 
third  covariate  was  changed  to  0.20.  Thus,  in  the  second 
level,  the  simple  linear  logistic  regression  model  did  not 
fit.  A  balanced  factorial  design  was  used. 

All  binomial  observations  involved  10  Bernouli  trials. 
Censoring  was  incorporated  as  follows:  For  each  Bernouli 
trial  a  uniform  (0,1)  random  deviate  was  generated  using 
function  GGUBFS  of  the  IMSL  (1984)  library.  If  the  gen¬ 
erated  deviate  was  greater  than  the  censoring  probability 
then  a  second  deviate  was  generated.  If  the  second  devi¬ 
ate  was  less  than  the  logistic  probability  at  the  given  dose, 
then  the  observations  count  was  incremented  by  one.  If 
the  first  deviate  was  less  than  the  censoring  probability, 
then  the  length  of  the  censoring  interval  was  increased  by 
one.  All  computations  were  performed  on  a  Data  General 
MV  10000  computer  and  coded  in  FORTRAN. 

Thirty  replications  at  each  combination  of  the  two  fac¬ 
tors  were  performed.  For  each  sample  of  four  observations 


in  each  replication,  the  simple  linear  logistic  model  was  fit 
involving  parameters  /9b  and  fix  according  to  the  methods 
discussed  above.  The  results  are  discussed  below. 

The  expected  (based  upon  the  30  replications)  regres¬ 
sion  slope,  fix  decreased  (p  <  0.0001)  as  the  censoring 
increased.  As  one  would  expect,  the  slope  also  changed  if 
the  model  was  not  correctly  specified  (p=.0963),  although 
the  effect  was  not  as  dramatic.  The  intercept  (/9b)  was  sig¬ 
nificantly  different  (p  <  0.0001)  when  the  model  was  not 
correctly  specified.  It  was  not  affected  (very  much)  by 
the  censoring  level.  As  the  censoring  level  increased,  the 
size  of  the  average  residual  decreased,  but  the  decrease 
was  not  always  significant.  To  get  a  feel  for  the  variation 
in  the  residuals,  consider  the  figure  below.  In  this  figure 
the  average  residuals  for  the  ANOVA  cells  which  fit  the 
model  are  plotted  at  points  (1, 2,  3,  and  4)  on  the  x-axis, 
with  all  three  censoring  levels  present,  while  the  average 
residuals  for  the  ANOVA  cells  which  did  not  fit  the  model 
are  plotted  at  x-axis  points  (5,  6,  7,  and  8).  The  censor¬ 
ing  levels  are  (0-0%,  X-10%,  +-30%).  Note  in  the  figure 
that  the  ordering  of  the  residuals  tends  to  be  +-smallest, 
X-second  smallest,  and  0-largest.  The  residuals  at  x  axis 
point  7  do  not  fit  the  model  (the  actual  binomial  proba¬ 
bility  was  changed  from  0.622  to  0.20).  This  fact  is  clear 
from  the  sice  of  the  residuals  at  this  point.  Also  note  that 
the  residuals  tend  to  be  smallest  when  the  data  fits  the 
model  (i.e.,  at  x-axis  points  1,  2,  3,  and  4). 

Figure  2 

The  Expected  Residuals  (o-0%,  x-10%,  +-30%) 


The  residual  plots  seem  to  indicate  that  one  can  com¬ 
pare  'censored'  residuals  with  residuals  from  observations 
which  are  not  censored.  One  should  expect,  however, 
that  residuals  from  censored  observations  will  tend  to  be 
smaller  in  magnitude. 
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HOTELLING’S  T2.  ROBUST  PRINCIPAL  COMPONENTS.  AND  GRAPHICS  FOR  SPC 


David  Coleman,  RCA  Laboratories 


Abstract 


Correlated  variables  in  the  same  physical  unit  are  commonly  measured  for  purposes  of 
statistical  process  control  (SPC)  -  for  example,  dimensions  of  a  manufactured  part.  The 
multivariate  control  chart  using  Hotelling's  T2  statistic  is  an  effective  process  control  tech¬ 
nique  for  this  multivariate  situation.  A  common  problem,  however,  has  been  how  to  make 
practical  use  of  such  control  charts.  It  is  often  hard  to  interpret  "out-of-control"  declara¬ 
tions  so  as  to  produce  control  actions  which  correct  the  problem.  Robust  principal  com¬ 
ponents  analysis  captures  the  covariance  structure  of  subgroups  of  multivariate  observa¬ 
tions.  Corresponding  graphical  techniques  help  us  interpret  bias  and  variability  problems 
much  more  effectively  than  can  a  multivariate  control  chart,  alone.  These  methods  can  be 
used  to  develop  a  diagnostic  SPC  system. 


1.  Main  Results  and  Conclusions.  In  Brief 


A  common  SPC  and  engineering  diagnostic  problem  is  to  interpret  and  use  multiple 
measurements  of  a  product  taken  in  the  same  physical  unit,  but  at  different  locations  on 
the  product  (Figure  1).  These  measurements  are  often  highly  correlated.  A  procedure  for 
SPC  is  motivated  and  described  in  this  paper,  using  as  an  example  the  relative  misalign¬ 
ment  of  photolithographic  grids  on  (integrated  circuit)  semiconductor  wafers.  The  pro¬ 
cedure  can  be  summarized  as  follows: 


(1)  Use  robust  principal  components  to:  (a)  Select  a  "process  base  sample."  of  a  large 
number  of  typical  multivariate  observations,  (b)  Estimate  a  "process  covariance  matrix." 
S  .  (c)  Compute  process  principal  axes  from  S’. 


(2)  Periodically  sample  subgroups  of  product  for  routine  SPC. 


(3)  For  each  subgroup:  (a)  [Optional]  Compute  Hotelling’s  T2  statistics  for  subgroup  bias 
and  variability  in  the  principal  axis  space  of  the  "process"  database  (Equation  (3)).  (b) 
Interpret  a  potential  variance  problem  using  a  robust  "Principal  Axis  Plot"  (Figures  7a  and 
7b)  for  the  subgroup,  (c)  Interpret  a  potential  bias  problem  by  using  both  a  "Spider  Plot" 
(Figure  6)  -  which  shows  an  exaggerated  representation  of  measurements  taken  on  sub¬ 
groups  of  product,  and  by  using  plots  of  the  subgroup  principal  components  (Figure  10). 


2.  Introduction 


2.1  Statement  of  the  Problem 


A  common  problem  in  manufacturing  is  that  of  maintaining  a  process  under  statistical 
control  when  it  has  several  correlated  process  or  performance  control  variables,  in  the 
same  physical  units.  The  prototypical  case  for  this  paper  will  be  the  physical  dimensions 
of  a  manufactured  part.  For  example,  suppose  two  oxide  grids  are  supposed  to  be  applied 
to  a  semiconductor  wafer  (perhaps  a  "monitor  wafer")  such  that  one  falls  exactly  upon 
the  other,  but  manufacturing  variability  causes  misalignment.  Figure  1  shows  a  simple 
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but  extreme  case  of  misalignment.  Measurements  for  the  purposes  of  SPC  and  diagnosis  of 
manufacturing  engineering  problems  might  be  taken  on  sample  wafers  at  a  small  subset  of 
the  grid  nodes,  such  as  the  5-by-5  array  shown  in  Figure  1.  These  measurements  might  be 
of  misalignment  or  of  som  other  variable,  such  as  thickness  or  sheet  resistivity.  Commer¬ 
cial  systems  are  currently  available  to  take  and  display  various  measurements  at  many 
locations  on  a  wafer  -  in  the  form  of  a  contour  plot.  As  far  as  the  author  can  determine, 
however,  none  of  them  has  the  statistical  capabilities  described  in  this  paper. 

In  one  of  the  simplest  statistical  models  of  misalignment  -  the  physical  displacement  of 
grid  B  relative  to  grid  A  -  a  simple  analytic  expression  would  allow  us  to  express  the 
transformations  associated  with  Euclidean  geometry:  rotation  about  some  point  (perhaps 
not  the  grid  center,  but  constrained  to  be  within  a  limited  domain),  then  horizontal  and 
vertical  translation: 

-xr+njjt  cos (0 1  +<fiiJ.t)+xrns+etjji 

(1) 

yus  ~yr*+rij  Jt  sin  (Oj  +  )+y!'an,+c?j  , 

i  =1  ....si  refers  to  the  wafer  number  within  a  sample 

j  —  l . 5  (the  horizontal  position  on  the  measurement  array) 

k  =1 . 5  (the  vertical  position  on  the  measurement  array) 

rijj<  =V (*u“j!l-xr*)2+(yiKj'j!,-yiot)2  =  the  Euclidean  distance  of  each  of  the  25  locations 
of  the  measurement  array  from  the  point  of  rotation,  (xf^.yf01), 

0  ,•  «  the  angle  of  rotation  about  the  point  of  rotation. 

2  actual _  ^  rot 

4>i.)  t  -  e  g-  ,sin~1(  - 1 — )  -  the  angle  position  of  the  25  locations  of  the  measure- 

riJJc 

menl  array  with  respect  to  the  point  of  rotation  (r,  j  ^  A  0). 

x!™'  .  yjrant  is  the  amount  of  translation  in  x  and  y  . 

js ;  and  zfj  j,  are  measurement  errors  in  the  x  and  y  directions. 

[Note  that  the  contraint  on  the  location  of  the  point  of  rotation  forces  us  to  also  allow  a 
translation:  any  (Euclidean)  isometry  can  be  described  by  a  rotation  about  a  point,  if  the 
location  of  such  a  point  is  unconstrained]. 

We  might  assume  a  model  such  as  this  and  use  a  procedure  such  as  nonlinear  least  squares 
to  estimate  the  unknown  parameters  of  the  transformations. 

Unfortunately,  model  (1)  is  too  simple  for  many  applications,  because  rotation  and 
physical  translation  may  be  inadequate  to  describe  the  possible  patterns  of  measurements, 
such  as  grid  misalignment.  We  may  want  to  allow  additional  transformations:  shearing, 
projections,  inversions  in  circles,  reflections  in  lines,  and  other  one-to-one.  differentiable 
mappings.  See  Figure  2  (as  suggested  in  [l]).  In  our  specific  wafer  example,  if  the  grids  are 
applied  by  projection  photo-lithography,  there  may  be  additional  grid  distortions  due  to 
process  or  wafer  irregularities,  such  as  vertical  or  horizontal  stretch  (or  compression  - 
"negative  stretch"),  diagonal  stretch,  radial  stretch,  saddle-shaped  wafer,  or  local  distor¬ 
tion  (e.g..  a  local  blemish).  Transformations  as  diverse  as  this  can  occur  in  measurements 
of  all  kinds  on  processes  of  all  kinds;  they  are  not  peculiar  to  wafer  fabrication. 

The  direct  mathematical  modeling  approach  is  always  appealing.  We  could,  in 


principle,  extend  model  (1)  to  include  various  additional  types  of  transformations.  How¬ 
ever.  we  may  not  be  able  to  state,  a  priori,  in  what  geometry  we  should  be  working  -  more 
specifically,  what  all  of  the  plausible  patterns  of  distortion  might  be.  Also,  the  patterns 
may  be  too  complex  or  diverse  for  us  to  be  confident  that  we  can  wisely  spend  degrees  of 
freedom  for  estimation  of  linear  and  non-linear  distortion  parameters.  In  addition,  we  are 
likely  to  be  interested  in  patterns  of  systematic  VARIATION  in  the  measurements,  as  well 
as  patterns  of  BIAS  (measurements  made  on  individual  samples  of  product).  These  varia¬ 
tional  patterns  may  be  even  more  difficult  to  specify,  a  priori.  Hence,  the  potential  com¬ 
plexity  of  the  direct  mathematical  modeling  approach  leads  us  to  consider  an  indirect 
approach.  This  paper  describes  such  an  indirect  approach:  a  diagnostic  SPC  system  for 
wafer  misalignment  or  some  other  similar  performance  variable. 


2.2  Hotelling’s  T2  Statistic(s) 

A  standard  textbook  strategy  for  SPC  when  there  are  correlated  process  variables  (see. 
for  example.  [2])  is  to  use  Hotelling's  T2  statistic. 


T2  =  (xj-x)'S  1(xJ—x)  =  yj'yj 


(2) 


where  Xj  is  the  j'h  p -dimensional  observation  which  we  want  to  assess  for  control,  x  is 
the  p -dimensional  vector  of  sample  means  of  the  p  (correlated)  process  variables  taken  on 
n  parts  (wafers,  in  our  example).  S  is  a  sample  covariance  matrix,  and  yt  =V  'D~l(xj  —x). 
the  j'h  observation  in  principal  axis  space  (V  is  from  the  eigen-decomposition,  S  - 
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which  is  quadratic  in  the  vector  x  (see  Figure  3). 


2.2.1  Use  of  Hotelling's  T2 


Hotelling's  T2  statistic  correctly  handles  the  correlation  structure  of  the  p  process 
variables  in  that  it  gives  true  a-level  (Type  I)  confidence  in  declaring  "out-of-control"  for 
simultaneous  values  of  the  p  variables.  It  would  seem  to  be  the  method  of  choice  for  the 
multivariate  SPC  situation.  As  J.S.  Hunter  has  repeatedly  commented  (e.g.,  [3]). 


"if  bivariate  charts  are  so  valuable,  why.  one  might  ask,  haven’t  such  charts  found  wider 
use?  Arithmetic  is  the  answer.  Hotelling's  T2  statistic  must  be  calculated  to  establish  the 
bivariate  [and  more  generally,  multivariate  ]  control  boundaries.  This  expression  and  its 
associated  arithmetic  may  appear  formidable,  but  they  are  not.  Today's  hand-held  calcu¬ 
lator  or  desk-top  computer  is  easily  programmed  to  complete  the  necessary  arithmetic  and 
graphics  within  a  few  seconds....  In  practice,  the  factory  worker  would  place  the  several 
measured  responses  into  the  hand-held  or  desk-top  calculator.  The  calculator  would  calcu¬ 
late  T2,  and  could  be  programmed  to  beep'  whenever  an  unusual  value  of  T 2  was 
obtained...  Monitoring  today's  processes  with  one-variable-at-a-lime  methods  is  to  throw 
away  information." 


It  is  hard  to  disagree  with  Hunter.  The  many  companies  now  supplying  SPC  software 
have  perhaps  not  yet  attained  this  level  of  sophistication,  but  it  is  only  a  matter  of  time. 
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The  real  impediment  to  multivariate  SPC,  other  than  computational  complexity,  has  been 
the  difficulty  of  interpretation.  However,  for  the  SPC  situations  described  in  this  paper, 
the  Hotelling’s  T2  approach  can  be  usefully  augmented  with  robust  methods  and  with  sta¬ 
tistical  graphics  so  as  to  ease  interpretation,  especially  for  manufacturing  engineers.  Some 
motivation  is  given  in  the  next  section. 

2.2.2  Limitations,  Especially  When  p  is  Large 

While  technically  correct,  and  usable  in  practice.  Hotelling's  T2  suffers  from  problems 
of  interpretation,  particularly  when  p  is  large. 

Consider  our  wafer  example,  when  the  causes  of  misalignment  might  be  complex. 
Succbut  it  might  be  hard  to  interpret  the  out-of-control  declarations,  and  to  decide  what 
to  do  to  get  the  process  back  in  control.  Indeed,  it  has  been  the  author’s  experience  that 
the  same  reaction  is  obtained  again  and  again,  after  describing  how  the  low  Type  1  error 
(false  reject)  rate  of  control  chart  mein  the  declaration  that  SOMETHING  is  wrong  with 
the  process:  "All  right.  WHAT  is  wrong  with  the  process:  what  do  I  fix?" 

This  is  not  an  easy  question  to  answer.  Indeed,  it  may  seem  to  be  better  answered  by 
an  engineer  who  is  more  familiar  with  the  process.  However,  more  information  CAN  be 
gleaned  from  a  control  chart  than  just  a  control  declaration.  For  the  simplest  case,  the 
univariate  control  chart,  a  collection  of  rules  of  thumb  can  be  usefully  developed.  These 
can  be  based  upon  the  standard  (or  any  specialized)  conditions  for  declaring  "out-of- 
control."  For  example,  such  rules  for  the  X  chart  can  take  the  form:  (a)  one  point  more 
than  3<r  from  the  mean  indicates  a  sudden,  extreme  departure  from  control,  or  (b)  eight 
points  in  a  row  on  the  same  side  of  the  mean  indicates  a  trend  or  a  slight  but  persistent 
shift.  Once  users  become  familiar  and  comfortable  with  these  rules  (and  the  construction 
and  philosophy  of  the  control  charts),  more  can  be  learned.  The  multiple  rules  serve  to 
narrow  the  field  of  possible  problems  in  that  it  is  more  likely  that  (a)  is  a  bolt  that 
snapped  and  (b)  is  tool  wear,  than  vice  versa.  Of  course,  engineering  interpretation  is 
always  needed  to  identify  the  specific  problem  at  hand. 

Similar  rules  of  thumb  can  be  developed  for  Hotelling's  T2  control  charts  -  but  the 
difficulties  of  interpretation  are  compounded.  In  the  bivariate  case,  the  simplest  condition 
is  one  extreme  point  beyond  the  99+%  level.  It  can  be  due  to  extreme  values  of  both  vari¬ 
ables.  or  just  one  of  the  two  variables,  or  a  pair  of  values  of  the  two  variables  that  is 
unusual  -  though  neither  value  may  be  extreme  in  itself.  In  Figure  3.  point  A  is  likely  to 
be  easy  to  interpret,  but  points  B  and  C,  and  especially  D  are  likely  to  be  difficult  to  inter¬ 
pret.  The  difficulties  of  interpretation  are  far  greater  when  we  consider  more  than  two. 
say  50  variables,  as  in  the  wafer  grid  misalignment  measurements.  With  a  problem  of  this 
size  and  potential  complexity.  Hotelling’s  T2  is  not  informative  enough.  We  need  effective 
interpretation,  not  just  correct  o-level  declaration  of  "oul-of-control." 

3.  Principal  Components  Analysis  on  Successive  Rational  Subgroups 

Jackson  advocates  the  use  of  principal  components  and  a  form  of  Hotelling’s  T2  statis¬ 
tic  for  rational  subgroups  ([4].  [5]).  Instead  of  the  simple  T2  statistic  for  a  single  part, 
given  above,  we  compute  (following  Jackson)  for  n  observations  and  k  ^ min(n  ,p  )  princi¬ 
pal  components  (ideally,  n  »  p  ). 

To  ,  =  T2]  ,+  To,  (3) 

n  n 
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grand  mean,  x  (note  that  we  replace  the  mean  in  (2)  with  the  grand  mean).  It  has  an 
asymptotic  Xkn  distribution  under  the  null  hypothesis  that  the  observations  in  this  sub¬ 
group  are  not  significantly  different  from  the  grand  mean  (usually  we  do  hypothesis  test¬ 
ing  with  T&  fandr o /•  instead  of  with  To2,). 

Tm  i  —  ny(,)'y(j)  (where  y(l)  =  £  — —  =  V  'Z)-,(*(,  >— x)  )  is  the  squared  bias  of  the 

j*i  n 

subgroup  mean  from  the  grand  mean.  It  has  an  asymptotic  XC  distribution  under  the  null 
hypothesis  that  the  bias  is  not  significantly  different  from  zero. 

To  ■, .  computed  by  To  ,  -  To  ,  ~T m  i .  is  the  variability  of  the  subgroup  observations  about 
their  own  mean,  x, .  It  has  an  asymptotic  Xt(n  -  u  distribution  under  the  null  hypothesis 
that  the  subgroup  variability  is  not  significantly  different  from  the  variability  from  which 
the  principal  components  were  derived. 

Paired  control  charts  of  T si  ,  and  To  ,  comprise  a  midtivariate  analog  of  X  and  R  (or 
s  )  charts,  as  briefly  mentioned  by  Jackson.  A  perusal  of  textbooks  and  current  literature 
indicates  that  these  paired  charts  are  rarely  used.  When  computed  robustly,  and  supple¬ 
mented  with  good  graphical  displays,  paired  T2  charts  can  be  very  effective. 

3.1  Computations 

A  classic  issue  in  principal  components  analysis  is  whether  to  do  an  eigen- 
decomposition  of  the  covariance  matrix.  S.  or  the  correlation  matrix.  R.  For  the  case  of 
interest  in  this  article  -  correlated  variables  in  the  same  physical  unit  -  the  use  of  S  is 
appropriate,  since  we  have  no  a  priori  reason  to  scale  measurements  at  one  location 
differently  than  at  another  location. 

Using  S  rather  than  R  also  allows  us  to  make  simple  quantitative  statements  of 
interest  about  the  principal  component  analysis.  For  example,  suppose  that  we  take  that 
the  variance  in  the  original  n  by  p  matrix  of  observations,  X ,  is  trace  (S ).  Since  the 
matrix  of  eigenvectors.  V  is  orthonormal,  the  total  variance  of  X  remains  the  same  after 
pre-multiplication  (V  'X  corresponds  to  a  rigid  rotation  of  the  basis  vectors).  Hence,  we 
can  make  statements  of  the  form,  "m  %  of  the  total  variance  in  X  is  accounted  for  by  the 
first  k  principal  axes,  which  are:  ..."  (as  stated  in  [6]). 

3.2  Graphical  Displays 

As  stated,  the  wafer  example  is  prototypical  for  this  paper.  However,  the  graphical 
tools  described  below  are  applicable  to  any  situation  where  dimensions  or  some  other 
process-related  or  performance-related  variable  of  a  part  are  measured  in  the  same  physi¬ 
cal  unit,  at  different  locations.  Two  graphical  tools  are  presented  for  interpreting  bias 
problems  in  misalignment.  One  of  the  tools  is  the  Exaggerated-Measurement  Plot  -  which 
is  similar  to  a  contour  plot.  For  subgroups  of  wafers  rather  than  individual  wafers,  we  can 
use  the  related  Spider  Plot.  Another  graphical  tool  presented  below  is  the  Principal  Axis 
Plot.  It  aids  in  interpreting  variability. 

3.2.1  Exaggerated-Measurement  Plots  and  Spider  Plots 

The  Exaggerated-Measurement  Plot  is  a  representation  of  the  part  on  which  measure¬ 
ments  have  been  made  -  but  with  exaggerated  representations  of  the  measurements 
displayed  (see  Figures  4a  and  4b). 

Specifically,  an  Exaggerated-Measurement  Plot  is  constructed  by  drawing  a  line  seg¬ 
ment  for  each  location  of  measurement.  One  end  of  the  line  segment  is  at  the  nominal 


location  of  measurement.  From  that  point,  the  line  segment  is  drawn  in  the  same  direction 
as  the  misalignment  at  that  location,  but  the  length  of  the  segment  is  scaled  (up)  so  as  to 
make  it  easier  to  identify  patterns  of  misalignment,  perhaps  such  as  shown  in  Figure  1. 
Thus,  the  axes  on  the  plot  implicitly  carry  two  scales:  the  scale  of  the  part  being  measured 
-  in  which  we  can  find  the  locations  of  measurement,  and  the  misalignment  error  scale  - 
which  might  be  several  orders  of  magnitude  less. 

The  Exaggerated-Measurement  Plot  is  analogous  to  a  caricature,  in  that  enough  of  the 
nominal  features  of  the  part  are  shown  so  that  the  part  can  be  recognized  for  what  it  is. 
but  the  measured  features  unique  to  that  particular  part  are  coded  into  the  graphical 
representation  so  as  to  highlight  those  features.  This  makes  it  easier  to  discriminate  among 
parts.  We  may  especially  want  to  tell  good  parts  f rom  bad  parts.  Figures  4a  and  4b  show 
Exaggerated-Measurement  Plots  for  misalignment  of  grids  on  two  wafers.  The  wafer  scale 
might  be  10-4  meters,  while  the  misalignment  scale  might  be  KT*  meters. 

By  overlaying  Exaggerated-Measurement  Plots  from  sample  members  of  a  subgroup 
(see  Figure  5)  we  can  see  the  bias  and  variability  of  subgroups  on  a  per-location  basis. 
However,  this  method  of  graphical  presentation  can  be  improved. 

R.  Barton  ([7])  and  (implicitly)  J.  E.  Jackson  ([4])  have  used  a  superior  means  of 
graphical  presentation  for  subgroups  of  data  of  this  form  -  the  Spider  Plot.  The  Spider 
Plot  helps  to  separate  bias  from  variability.  Jackson  illustrated  it  as  a  means  for  geometri¬ 
cally  representing  the  relationships  between  the  T2  statistics  of  (3).  For  applications  such 
as  our  wafer  example,  it  can  be  used  directly  at  each  measurement  location,  as  seen  in  Fig¬ 
ure  6.  It  is  constructed  as  follows:  (a)  Exaggerate  all  misalignment  errors  by  the  same 
amount,  e.g..  by  a  factor  of  100.  Do  the  following  for  each  measurement  location:  (b) 
Compute  the  subgroup  mean  exaggerated  error,  (c)  Draw  a  line  segment  from  the  nominal 
measurement  location  to  the  subgroup  mean,  (d)  Draw  one  line  segment  per  wafer  from 
the  subgroup  mean-  to  the  exaggerated  error  position  -  as  would  be  obtained  in  an 
Exaggerated-Measurement  Plot  representation  for  that  wafer  at  that  location. 

Another  way  to  conceptualize  how  a  Spider  Plot  is  constructed  is  to  suppose,  for  a 
moment,  that  the  subgroup  mean  is  zero  at  all  measurement  locations.  Then,  the  Spider 
Plot  would  be  identical  to  overlaid  Exaggerated-Measurement  Plots.  Each  spider  could 
then  be  displaced  from  the  nominal  position  to  the  actual  subgroup  mean,  per  location. 
This  per-location  subgroup  bias  is  also  represented  by  a  line  segment  -  with  a  solid  dot  at 
each  end. 

The  Spider  Plot  helps  us  to  make  qualitative  assessments  of  the  form.  "Is  the  within- 
subgroup  variability  sufficient  to  disregard  the  bias  from  the  nominal?"  and  "is  there  a 
pattern  in  the  bias  which  is  independent  of  the  magnitude  or  shape  of  the  variability?"  and 
"Are  all  the  spiders  of  the  'same  species?'  " 

Either  of  the  above  plots  is  the  natural  companion  to  either  the  T2  (for  individual 
parts)  or  the  Tm  i  (for  subgroups  of  parts)  statistic.  They  each  show  departure  from  nom¬ 
inal  which  is  in  the  form  of  bias.  Exaggerated-Measurement  Plots  or  Spider  Plots  could  be 
examined  routinely,  or  a  control  procedure  could  be  designed  so  that  when  Tf  or  r&, 
(whichever  was  being  used)  was  declared  out  of  control,  an  exaggerated-measurement  or 
Spider  Plot  could  be  produced  (by  computer,  of  course)  to  help  diagnose  the  problem.  See 
Section  5.1  for  such  a  procedure. 

3.2.2  Principal  Axis  Plots  ("Major  Motion  Pictures") 

Principal  Axis  Plots  address  the  systematic-variability  side  of  the  control  question.  A 
limitation  of  the  exaggerated-measurement  and  the  Spider  Plots  is  that  though  they 
highlight  part-to-part  or  subgroup-lo-subgroup  differences,  it  is  hard  to  see  PATTERNS  of 
differences  between  parts  or  subgroups.  Thisjimilalion  holds  also  for  the  classical  statis¬ 
tics  we  might  compute  on  the  X  matrix:  X  and  S:  X  is  computed  per  measurement 


location,  and  5  shows  only  pairwise  relationships.  Though  we  can  compare  bias  and  vari¬ 
ability.  location-to-location.  in  Figures  5  and  6.  we  cannot  readily  perceive  whether  or  not 
the  variation  is  systematic.  In  our  assumed  multivariate  SPC  situation,  we  know  there  is 
high  correlation  -  so  much  of  the  variation  IS  systematic.  A  principal  components  analysis 
gives  how  much  variation  is  systematic,  and  a  Principal  Axis  Plot  shows  the  pattern  of 
systematic  variation. 

The  Principal  Axis  Plot,  such  as  shown  in  Figures  7a  and  7b.  is  constructed  as  follows: 
(a)  Do  principal  components  analysis  of  the  subgroup,  resulting  in  (usually)  a  few  major 
principal  axes  (that  is.  axes  associated  with  relatively  large  principal  values),  (b)  Scale 
each  principal  axis  to  be  displayed  (perhaps  by  a  fixed  value,  or  proportional  to  the  associ¬ 
ated  principal  value),  (c)  For  each  principal  axis  to  be  represented,  place  a  limeasurement 
location  with  one  end  positioned  at  the  nominal  measurement  location,  (d)  Draw  the  line 
segment  as  in  an  Exaggerated-Measurement  Plot  -  treating  the  scaled  principal  axis  as  a 
vector  of  misalignment  errors,  with  a  horizontal  and  a  vertical  component  for  each  meas¬ 
urement  location,  (e)  Put  "arrowheads"  on  the  ends  of  the  line  segments,  to  help  distin¬ 
guish  the  Principal  Axis  Plot  from  an  Exaggerated-Measurement  Plot  in  appearance. 

Figure  7a  is  a  Principal  Axis  Plot  for  the  first  principal  axis  for  a  subgroup  of  wafers. 
It  is  apparent  that  much  (nearly  88%)  of  the  variation  in  misalignment  is  along  a  30°  —45° 
(diagonal)  direction.  The  second  principal  axis  is  shown  in  Figure  7b,  and  it  appears  to  be 
compression  along  one  Cartesian  axis,  and  expansion  along  the  other,  with  an  origin  near 
the  array  point:  (  row  2.  column  4).  The  orthogonality  of  principal  axes  lessens  the  likeli¬ 
hood  that  patterns  of  systematic  variability  will  be  confounded  in  either  of  the  two  ways: 
two  sources  of  variability  captured  in  one  major  principal  axis,  or  one  source  of  variability 
split  into  two  major  principal  axes.  As  with  Exaggerated-Measurement  Plots  and  Spider 
Plots,  we  search  for  patterns  such  as  illustrated  in  Figure  2  when  we  examine  Principal 
Axis  Plots. 

The  Principal  Axis  Plot  is  the  natural  companion  to  the  T&  i  statistic,  which  is  just  the 
sum  of  the  T  2  statistics  of  (2)  if  there  is  only  one  subgroup  (in  which  case  the  grand  mean 
is  identical  to  the  mean).  Principal  Axis  Plots  could  be  examined  routinely,  or  a  control 
procedure  could  be  designed  so  that  when  Tp  ,  was  declared  out  of  control,  a  Principal 
Axis  Plot  could  be  produced  of  the  offending  subgroup,  as  given  in  Section  5.1. 

3.3  Interpretation 

In  Section  2.2.2.  the  motivation  for  providing  a  means  of  interpretation  of  T 2  statistics 
was  given.  For  SPC  in  the  multivariate  situation  discussed  in  this  paper,  principal  com¬ 
ponents  analysis  plus  the  three  graphical  techniques  described  in  Section  3.2  help  with 
engineering  interpretation.  This  section  is  brief,  because  general  discussion  of  interpreta¬ 
tion  is  necessarily  limited  in  scope.  Section  5.2  has  a  more  detailed  example. 

3.3.1  Classification 

A  good  way  to  interpret  correlated  process  variables  in  the  same  physical  unit  is  to 
look  for  patterns  among  the  values  of  the  variables  -  particularly  when  those  values  have 
been  declared  out  of  control  by  a  multivariate  control  procedure,  such  as  Hotelling’s  T2. 
This  is  especially  effective  for  correlated  process-related  or  performance-related  variables 
measured  at  different  locations  of  a  manufactured  part.  For  what  types  of  patterns  should 
we  look?  Probably  the  same  types  whether  we  are  examining  bias  (using  a  Spider  Plot  or 
Exaggerated-Measurement  Plot)  or  systematic  variability  (using  a  Principal  Axis  Plot)  - 
though  the  interpretations  would  naturally  differ.  Typically,  we  should  look  for:  (1) 
"global"  patterns  associated  with  the  types  of  one-to-one,  differentiable  transformations 
listed  in  Section  2.  such  as:  translation,  rotation  (not  necessarily  about  the  center),  stretch 
in  one  dimension  (not  necessarily  the  original  measurement  dimensions),  radial  stretch 


(not  necessarily  centered),  corner  effects  (such  as  "sag"),  edge  effects  (such  as  a  "rim"),  and 
(2)  "local"  phenomena,  such  as  isolated  outliers  or  small  clusters  of  outliers.  When  a  pro¬ 
cess  is  declared  out  of  control  in  subgroup  variability,  the  type  of  thing  that  we  do  NOT 
want  to  see  in  an  a  Principal  Axis  Plot  is  seen  in  Figure  8.  Such  a  pattern  of  systematic 
variability  would  be  difficult  to  interpret  -  especially  if  it  were  associated  with  a  large 
principal  value  (that  is.  it  "accounted  for"  a  lot  of  variability  in  the  raw  measurement 
data).  If  we  were  to  get  an  Exaggerated-Measurement  Plot  with  line  segments  such  as 
shown  in  Figure  8,  it  would  imply  local  problems  rather  than  global  problems. 

3.3.2  Manufacturing  Diagnostic  Tables 

The  pragmatic  advantage  to  classifying  patterns  is  the  potential  for  developing  rules  of 
thumb  for  engineering  action.  These  might  be  analogous  to  (but  more  complex  than)  those 
developed  for  X  and  R  charts.  Two  basic  methods  might  be  used  for  tracing  back  perfor¬ 
mance  patterns  of  bias  or  variability  to  their  root  causes:  (1)  physical  phenomena  affected 
by  the  geometry,  and  (2)  pattern  preservation. 

Physical  phenomena  affected  by  the  geometry  include  such  processes  as: 

•  film  deposition  -  which  might  depend  on  distance  from  the  source  of  deposition 
material. 

•  thermal,  chemical,  or  electrical  processing  -  which  might  be  more  extensive  along  the 
edges  and/or  corners  because  they  are  more  exposed. 

•  imaging  of  light  or  electrons  -  which  might  have  distortion  approximately  propor¬ 
tional  to  the  angle  of  refraction  or  deflection. 

Pattern  preservation  is  a  conservation  law,  which  slates  that  if  a  certain  type  of  devia¬ 
tion  from  nominal  (e.g..  a  lateral  shift)  is  introduced  during  manufacturing,  it  will  be 
preserved  in  the  performance  variables  of  the  final  product,  unless  specifically  taken  out. 
When  process  steps  are  more  or  less  independent  of  one  another,  common  sense  tells  us 
that  patterns  of  deviations  will  be  preserved.  But  even  though  this  is  often  the  case,  pat¬ 
tern  preservation  is  not  often  exploited  in  multivariate  SPC.  Examples  are:  lateral  shift  of 
a  part  relative  to  a  fixture,  rotation  of  part  relative  to  fixture,  optical  process  errors  in 
setup  leading  to  stretch  or  shrinkage,  or  other  one-to-one,  differentiable  transformations  of 
many  kinds. 

Departures  from  the  assumption  that  the  process  steps  are  independent  can  lake  two 
basic  forms.  Further  process  steps  can  reduce  patterns,  or  can  magnify  patterns.  Reduc¬ 
tion  can  occur  when  processes  do  relative  tolerancing  or  relative  alignment,  rather  than 
absolute.  Or.  a  later  process  step  may  remove  the  deviation  from  nominal  altogether  (e.g. 
incorrect  seating  in  a  clamp  might  be  deliberately  corrected,  or  inadvertently  corrected  by 
handling).  Magnification  can  occur  when  a  small  deviation  from  nominal  can  propagate,  or 
have  effects  which  propagate  (e.g.  a  small  burr  in  a  milling  process  can  cause  extensive 
milling  irregularities,  or  an  off-trajectory  laser  beam  may  get  further  and  further  off- 
trajectory).  After-the-fact  SPC  techniques  are  most  likely  to  help  identify  the  cause  when 
independence  or  magnification  holds. 

4.  Robust  Principal  Components  Analysis 

To  guard  against  distortions  due  to  atypical  data  values.  Devlin.  Gnanadesikan.  and 
Ketlenring  ([8 J)  suggest  several  methods  for  computing  robust  principal  components. 
Additionally,  these  methods  can  be  used  to  identify  such  atypical  values.  They  may  be 
the  most  informative  data. 

4.1  Alternatives 

Devlin  et  al.  recommend  the  use  of  three  robust  techniques  for  principal  component 
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analysis.  One  is  iterative  multivariate  trimming  (MVT)  based  on  trimming  a  fixed  propor¬ 
tion.  a.  of  observations  with  extreme  (Mahalanobis)  distances. 
d>V(  Xj— m')‘S'  l(xj  — m*)  .  where  m*  is  the  current  measure  of  location,  and  S'  is  the 
current  robust  estimate  of  S  (m  and  S  are  used  as  the  starting-points  for  M'  and  S"  ). 
Other  recommended  approaches  are  based  on  "maximum  likelihood  t "  (MLT).  for  only  the 
Cauchy  case  (1  degree  of  freedom),  and  a  Huber-weights- based  method  (HUB)  designed  to 
make  S'  asymptotically  unbiased  for  the  multivariate  normal.  MLT  and  HUB  were  judged 
superior  to  MVT  with  regard  to  estimation  of  correlations,  eigenvalues,  and  eigenvectors, 
but  MVT  converges  faster,  and  is  recommended  by  the  authors  for  large  values  of  p  - 
which  is  typical  of  the  SPC  situation  considered  in  this  paper. 

For  this  reason,  and  because  MVT  is  invariant  under  all  nonsingular  linear  transforma¬ 
tions  of  the  observations,  the  author  has  used  and  recommends  the  use  of  MVT.  Analysis 
of  rejected  observations  is  also  recommended,  and  is  part  of  the  procedure  given  in  Section 
5.1  . 


5.  A  Working  Procedure  for  SPC.  and  A  Wafer  Example 
5.1  A  Working  Procedure 

Below  is  summarized  a  procedure  for  SPC  with  strong  engineering  support  (a  more 
detailed  and  thorough  form  of  this  procedure  has  been  developed,  but  is  not  given  in  this 
paper).  It  is  assumed  that  the  engineers  have  good  analytic  and  interpretive  skills,  and 
that  the  product  design  is  stable  enough  so  that  a  body  of  process  knowledge  can  be  accu¬ 
mulated  and  refined. 

[Note  that  a  subtlety  of  the  multivariate  situation  is  that  either  or  both  bias  and  variabil¬ 
ity  can  be  declared^  out  of  control  when  neither  bias  nor  variability  has  statistically 
significantly  increased  -  but  the  covariance  structure  has  sufficiently  changed.] 


ROBUST.  MULTIVARIATE  SPC  PROCEDURE 

1)  [GET.  S')  Compute  a  robust  sample  covariance  matrix.  S',  using  MVT  on  a  N  by  p 
matrix.  Xnarl  .  of  selected  parts  -  resulting  in  trimmed  matrix.  X^„  ( N  is  assumed  large), 
and  associated  major  principal  axis  space,  V£"'or. 

2)  [SAMPLE  FOR  ROUTINE  SPC]  Sample  subgroup  i  of  n  parts,  resulting  in  n  by  p 
matrix.  X, . 

3)  [GET  T2's]  Compute  Hotelling's  Tp  /.  and  in  Plot  the  Hotelling's  T 2 

values  on  separate,  parallel.  Hotelling's  T2  control  charts,  and  apply  the  appropriate  con¬ 
trol  chart  rule(s). 

4)  [INTERPRET  VARIABILITY  PROBLEMS]  If  Tg ,  is  out  of  control,  examine  X,  in  Vg£r. 
look  for  outliers  which  might  inflate  the  subgroup  variance,  and  seek  to  correct.  If  not 
present,  compute  robust  principal  components  of  Xj.  Examine  the  major  Principal  Axis 
Plots  for  systematic  patterns  of  variability,  and  compare  them  to  those  produced  for  X^„. 
Diagnose,  record,  and  go  back  to  2. 

5)  [INTERPRET  BIAS  PROBLEMS]  If  T&  ,  is  out  of  control,  produce  the  Spider  Plot  for 
Xj.  Compare  to  the  Spider  Plot  of  X^Sf.  Look  for  outliers  and  systematic  bias,  and  seek 
to  correct.  Optionally,  examine  Xj  in  V{£’Jor.  look  for  outliers,  and  seek  to  correct.  Diag¬ 
nose.  record,  and  go  back  to  2. 


5.2  A  Wafer  Example 


Suppose  that  a  projection  photo-lithographic  process  applies  two  rectangular  grids  for 
oxide  deposition  on  semiconductor  wafers:  (l)  apply  grid  A,  (2)  change  projection  source 
and  wafer  fixture  (reflecting  process  differences  in  applying  the  two  different  grids).  (3) 
apply  grid  B.  As  described  above,  the  nominal  design  calls  for  grid  B  to  fall  exactly  upon 
grid  A.  but  there  is  always  some  misalignment. 

To  carry  out  SPC  on  wafers,  both  vertical  and  horizontal  misalignment  of  grids  is 
measured  at  25  positions  on  a  5  by  5  array,  as  shown  in  Figure  1.  The  measurements  are 
highly  correlated,  because  any  one  grid  of  vertical  and  horizontal  lines  is  applied  in  a  sin¬ 
gle  process  step.  Differences  from  the  nominal  location  within  a  process  step  are  not 
trivial,  however  (correlations  are  not  +/-  1).  because  they  can  be  due  to  a  variety  of  devia¬ 
tions  from  the  ideal  process  setup  -  which  do  not  necessarily  affect  misalignment  uni¬ 
formly  over  the  gridded  wafer. 

Hotelling's  ,  and  T£  f  can  be  used  on  production  wafer  subgroups.  Classical  and 
robust  sample  covariance  matrices  can  be  estimated  based  on  many  wafers  sampled  uni¬ 
formly  over  a  considerable  period  of  production  time.  After  resolving  any  concerns  about 
outliers  in  the  sample  used  to  compute  S.  subgroups  of  wafers  can  be  then  sampled  for 
routine  SPC.  Figure  6  shows  the  Spider  Plot  for  a  sample  from  a  subgroup  which  was  "out 
of  control."  according  to  its  Hotelling's  T&  ,  statistic,  indicating  a  variability  problem.  Fig¬ 
ures  7a  and  7b  show  the  first  two  major  Principal  Axis  Plots.  We  can  see  that  our  chief 
concern  should  be  variable  diagonal  translation  -  perhaps  due  to  problems  with  wafer  or 
projection  lens  fixtures.  Secondarily,  we  may  choose  to  try  to  reduce  an 
expansion/compression  problem  -  perhaps  due  to  projection  lens  distortions  or  wafer  dis¬ 
tortions. 

Another  subgroup  would  have  been  declared  "out-of -control"  due  to  high  values  of 
both  of  its  Hotelling^s  T2  statistics,  had  not  robust  principal  components  been  used.  It  was 
contaminated  by  an  outlying  observation.  Robust  principal  components  rejected  observa¬ 
tion  #7.  as  shown  in  Figures  9  and  10.  When  observation  #7  was  replaced  by  one  taken 
from  another  wafer,  the  subgroup  was  no  longer  out  of  control,  according  to  T’^statislics. 

The  example  could  be  continued  to  illustrate  the  complexity  and  subtlety  of  possible 
multivariate  phenomena  and  to  provide  illustration  of  the  various  ways  that  the  proposed 
control  .machinery  could  be  used.  Instead,  the  reader  can  study  the  different  paths  of  the 
procedure  provided  in  Section  5.1  . 


6.  Summary 

Hotelling's  T2.  robust  principal  components,  and  appropriate  graphical  displays  can  be 
combined  to  form  a  sophisticated  system  for  statistical  process  control  when  many  corre¬ 
lated  performance  variables  are  measured  in  the  same  physical  unit.  We  need  not  be  so 
concerned  about  the  computational  complexity  or  burden  of  such  a  SPC  system.  Rather, 
the  harder  task  is  to  develop  systems  which  are  not  only  statistically  sound,  but  can  lead 
to  meaningful,  interpretable  results  which  guide  corrective  action.  In  brief:  (1)  Hotelling's 
Tm  i  and  ;  statistics  are  the  mdUvariale_analog  of  X  and  R  charts,  and  should  be  used 
rather  than  a  multitude  of  simultaneous  X  and  R  charts  or  the  standard  T2',  (2)  Use  of 
robust  principal  components  analysis  helps  us  to  avoid  misinformation  due  to  atypical 
data  -  such  as  outliers  and  departures  from  the  standard  process  correlation  structure.  It 
also  makes  possible  more  subtle  interpretation  of  multivariate  data:  (3)  High-resolution 
computer  graphics  are  widely  available,  and  graphical  tools  for  SPC,  such  as  Spider  Plots 
and  Principal  Axis  Plots,  can  be  built  into  an  interactive  SPC  system.  Lastly,  the  only 
way  such  a  complex  SPC  system  can  work  is  to  have  well-trained  process  engineers  who 
are  dedicated  to  improving  the  process,  and  who  have  good  analytic  and  interpretive  skills. 

The  application  of  this  control  methodology  to  wafer  grid  misalignment  is  only  one 
example  of  its  potential  use.  With  slight  modification  it  can  be  applied  to  the  case  of  any 


measurement  taken  at  different  locations  on  a  wafer  -  sheet  resistivity,  thickness,  induc¬ 
tance.  etc.  More  important,  it  also  applies  to  other  manufactured  products  which  have 
many  measurements  taken  in  the  same  physical  units.  For  example,  it  could  be  applied  to 
registration  patterns  on  PC  boards,  emulsion  thickness  or  purity  on  photographic  film, 
thickness  or  density  of  "uniform"  sheets  of  steel  or  some  other  material,  diameters  of 
spheres  or  cylinders,  such  as  ball  bearings  or  rods,  or  physical  dimensions  of  arbitrarily- 
shaped  parts.  The  Exaggerated-Measurement  Plot.  Spider  Plot,  and  Principal  Axis  Plot  can 
be  generalized  for  these  other  areas  of  application.  The  Exaggerated-Measurement  Plot  and 
Principal  Axis  Plot  can  be  replaced  by  contour  or  3D  plots.  The  Spider  Plot  is  harder  to 
generalize,  but  a  glyph  plot  can  be  used. 

Areas  for  further  research  include  distributional  theory  for  small  sample  sizes,  analogs 
to  the  L  2-norm-based  covariance  matrix  and  principal  components  -  perhaps  based  on 
other  norms,  and  ways  to  catalog  process-related  patterns. 
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An  extreme  case  of  grid  misalignment,  due  to  manufacturing  variations,  on  a  (perhaps 
monitor)  semiconductor  wafer.  A  J-by-5  array  of  misalignment  measurements  is  taken  as 
a  relative  displacement  of  grid  A  to  grid  B.  The  locations  of  the  measurements  can  either 
be  absolute  or  determined,  de  facto,  by  one  of  the  grids. 
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COLLINEARITY  AND  POINTS  OF  EXPANSION  IN  POLYNOMIAL  REGRESSION 
(Inter la  Report) 


Michael  F.  Driscoll,  Arizona  State  University 


ABSTRACT 

Frequent  use  is  Bade  of  single-  and  aulti- 
variable  polynomial  regression  aodels  in 
situations  for  which  prior  knowledge  fails  to 
suggest  a  specific  response  function.  In  such 
aodels,  the  maerical  and  statistical  stability 
of  the  least-squares  estiaators  are  highly 
dependent  on  the  point  of  expansion,  or  origin, 
used  for  the  underlying  predictor  variables . 
Results  of  same  mathematical  analyses  of  this 
phenomenon  are  given.  Related  issues  in  variable 
selection  and  diagnostic  checking  are  also 
considered. 


1.  INTRODUCTION 

The  purpose  of  this  paper  is  to  give  a 
overview  of  issues  which  arise  when  doing 
ordinary  least-squares  fitting  of  functional 
relationships  described  only  as  polynomials  in 
one  or  more  variables.  This  method,  herein 
called  the  polynomial  approach,  is  c<mm»nly  used 
in  pilot  studies  and  is  necessary  whenever  prior 
knowledge  is  not  detailed  enough  to  postulate  a 
more  specific  relational  fora  for  examination. 

In  Section  2,  the  polynomial  approach  to  a 
regression  problem  is  described  very  generally  as 
a  problem  of  estimating  the  coefficients  in  a 
power  series  approxiaation  to  a  smooth  function. 
Assumptions  and  difficulties  inherent  in  this  use 
of  power  series  are  discussed  in  Section  3.  The 
question  of  selecting  an  optimal  point  about 
which  to  express  the  series  expansion  is 
considered  in  Sections  4  and  5.  Some  final 
remarks  are  Bade  in  Section  6. 

This  paper  is  designated  an  interim  report 
since  the  research  summarized  in  it  is  still 
incomplete  (especially  in  Section  5).  I  have 
proofs  for  some  results;  sketches  of  the  aore 
detailed  of  these  are  given  in  the  Appendix. 
Statements  which  I  believe  to  be  true  but  have 
not  yet  proven  are  offered  as  conjectures. 


2.  THE  POLYNOMIAL  APPROACH 

The  aia  of  aultiple  least-squares  linear 
regression  analysis  is  to  deteraine  a  suitable 
model 

(2.1)  Y  =  go  +  fiiXi  +  •••  +  ptXt  +  t 

for  describing  the  relationship  between  a 
response  variable  Y  and  several  predictor 
variables 

(2.2)  Xi  =  ft(Ui,...,Ub) 

defined  from  certain  underlying  or  basic 

varisbles  Ui . Ub  available  to  the  analyst.  It 

is  often  not  clear  at  the  outset  what  fora  the 
predictors  (2.2)  should  have,  so  a  coamon  aethod 
is  to  use  predictors  of  the  fora 


X(Rlc)  = 

(2.3)  Pi  P*  Pt 

(Ui-ci)  (lfe-ca)  •••  (lh-Cb)  , 

whose  definition  depends  on  the  vectors  p  = 
(pi,...,Pb)'  and  c  =  (ci , . . . ,cb ) ' .  The  effect  of 
the  polynomial  approach  is  that  the  unknown  true 
response  function. 

(2.4)  H(Y)  .  g(Ui . Ub) 

(say),  is  replaced  by  an  unknown  approximate 
response  function 

(2.5)  B(Y)  a  E  B(rIc)  X(j>l c) 

which  is  a  partial  power  aeries  expansion  of 
(2.4)  about  the  point  c  in  the  space  of  the  basic 
variables.  The  need  is  then  to  obtain  a  fitted 
response  function 

(2.6)  Y~  =  E  ^(Rlc)  X(plc)  , 

that  is,  to  obtain  the  least-squares  estimates 
0~(j>l c)  of  the  parameters  p(plc).  The  highest 
value  of 

pi  ♦  •  •  •  +  pt 

among  the  predictors  used  is  called  the  order  of 
the  aodel,  and  is  herein  denoted  by  P. 


3.  CONSEQUENCES  OF  SERIES  APPROXIMATION 

The  polynomial  approach  to  a  regression 
problem  entails  several  conceptual  and  practical 
concerns  which  can  be  illuminated  by  emphasizing 
the  effect  of  the  series  approxiaation  of  (2.4) 
by  (2.5).  One  is  the  tacit  assumption  that  the 
true  response  function  has  only  very  aild 
discontinuities,  if  any,  in  the  pertinent  part  of 
the  domain  of  the  basic  variables.  This 
assuaption  is  in  fact  unavoidable:  if  the 
analyst's  knowledge  is  insufficient  to  suggest  an 
approach  aore  specific  than  one  based  on  power 
series  approxiaation,  then  he  is  unlikely  to  have 
such  inforaation  on  continuity  properties. 

Choice  of  the  order  of  the  series 
approxiaation  (2.5)  is  a  aore  Immediate  concern. 
If  (2.4)  is  assumed  to  be  continuous  in  its 
several  arguments,  then  there  is  no  question  but 
that  a  large  enough  value  of  P  for  (2.5)  will 
produce  an  approximate  response  function  which  is 
practically  indistinguishable  from  the  true 
response  function.  In  single-variable  problems 
(those  with  just  one  basic  variable  U)  one  has 
considerable  flexibility  in  selecting  P.  But  in 
multi-variable  problems,  one’s  choice  of  P  is 
liaited  by  computational  requirements  and,  aore 
importantly,  by  the  difficulty  of  interpreting 
interaction  terms.  The  larger  the  mmber  of 
basic  variables,  the  less  workable  are  the  higher 
order  approxiaations.  This  aspect  of  the 
polynomial  approach  is  well  understood,  so  aodels 
of  low  order  are  often  used. 
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Hie  most  important  indeterminate  in  (2.5)  is 

the  point  of  expansion  or  centering,  (ci . o). 

It  plays  a  pivotal  role  in  the  expression  of  the 
series  approximation,  and  therefore  can  have  an 
overwhelming  impact  on  the  fitted  response 
function.  However,  it  is  absent  in  the  true 
response  function  (2.4). 

There  is  some  controversy  over  the  idea  of 
changing  scale  or  location  in  regression 
predictors,  due  to  the  possible  affect  on 
diagnostics  for  predictor  ill-conditioning.  The 
paper  by  Belsley  (1984)  and  the  attendant 
comments  by  others  discuss  some  fundamental 
advantages  and  disadvantages  of  mean-centering 
the  predictors.  The  centering  being  considered 
in  this  paper  is  the  selection  of  a  point  of 
expansion  in  the  space  of  the  basic  variables, 
around  which  the  predictors  are  to  be 
constructed;  centering  of  the  predictors 
themselves  is  not  being  treated. 

The  true  response  function  does  not  depend  on 
the  point  of  expansion,  so  the  description  of  the 
response  function  implied  by  its  approximate  form 
(2.5)  and  obtained  from  the  regression  fit  (2.6) 
must  be  invariant  to  the  point  of  expansion  used. 
In  other  words,  the  fitted  response  functions 
obtained  from  various  points  of  expansion  must  be 
equivalent  descriptions  of  the  true  response 
function.  This  requirement  is  net  only  if  (2.6) 
is  complete  in  the  sense  that  it  contains  all 
predictors  through  a  given  order,  P  —  that  is, 
includes  all  terms  for  which  the  elements  of  p 
satisfy  the  condition 


0  s  pi  + 


+  pt  s  P 


—  or  in  the  weaker  sense  that  it  contains  all 
terms  which  can  be  produced  by  striking  one  or 
more  factors  from  any  predictor  present. 

Completeness  in  polynomial  models  is  discussed 
in  detail  in  Driscoll  and  Anderson  (1980).  The 
need  for  it  is  not  widely  enough  appreciated  by 
practitioners.  Predictor  selection  done  under 
the  polynomial  approach  is  often  flawed, 
resulting  in  misapplications  which  are  analogous 
to  (if  less  transparent  than)  using  a  zero 
intercept  model  without  intending  to  do  so. 
Griepentrog,  Ryan,  and  Smith  (1982)  allude  to 
this  issue  in  discussing  the  affect  which  changes 
of  location  and  scale  in  the  basic  variables  can 
have  on  the  t-tests  for  lower-order  coefficients 
in  the  polynomial  model.  The  example  they  give 
is  meant  to  illustrate  that  such  t-tests  may  be 
meaningless.  It  is  more  properly  viewed  as  an 
argument  in  favor  of  a  logically  prior 
requirement:  that  of  respecting  any  hierarchy 
present  among  the  predictors  used.  This  is  more 
easily  accomplished  today  than  in  the  past, 
although  it  still  requires  some  user  effort, 
especially  if  an  "all"  subsets  analysis  is 
desired.  At  least  program  BMDP2R  (Dixon,  1983, 
Appendix  F.4)  now  has  facilities  for  entering  and 
removing  predictors  in  a  specified  sequence  or  by 
defined  groups;  and,  of  course,  there  is  program 
BMDP5R  for  single-variable  models.  Procedure 
STEPWISE  (SAS  Institute,  Inc.,  1985,  p.  764)  has 
no  such  facilities,  but  the  manual  does  refer  the 
user  to  a  survey  article  by  Hocking  (1976). 

On  the  other  hand,  the  fact  that  the  point  of 
expansion  is  indeterminate  has  its  advantages. 
In  particular,  it  allows  the  analyst  to  choose 


the  point  of  expansion  so  as  to  satisfy  other 
needs.  One  auxiliary  goal  is  to  mitigate  the 
numerical  and  statistical  instabilities  arising 
from  the  collinearity  pandemic  in  polynomial 
regression  models.  Another  is  to  enhance  the 
interpretability  of  the  resulting  fitted  model 
(2.6).  Snee  and  Marquardt  (1984),  in  their 
ccaent  on  Belsley’s  paper,  give  an  analysis  of 
some  tree-voliae  data  which  nicely  illustrates 
this  benefit  of  point  of  expansion  indeterminacy. 

Discussion  now  turns  to  criteria  for  selection 
of  the  point  of  expansion.  The  obvious  starting 
point  is  with  one  basic  variable,  that  is,  with 
single-variable  polynomial  regression. 


4.  ONE  BASIC  VARIABLE 

In  the  single-variable  polynomial  regression 
model,  the  predictors  (2.3)  are 

(4.1)  X(plc)  =  (U  -  c)«  ,  p  =  0,1 . P  , 

where  P  is  the  order  of  the  approximate  and 
fitted  response  functions.  The  investigation 
into  the  choice  of  the  point  of  expansion  will  be 
done  in  sample  terms  to  avoid  the  need  for 
distributional  asswptions.  A  sample  ui  ,  m  , 
...  ,  Un  of  observed  values  of  the  basic  variable 
provides  samples 


(4.2)  xi (pic)  =  (ui  -  c)»  ,  i  =  1,2, 


on  each  of  the  predictors.  To  avoid  annoying 
qualifications  in  what  follows,  it  is  assiaed 
throughout  that  the  ui ’s  contain  at  least  three 
distinct  values.  [The  case  of  just  two  distinct 
values  is  trivial.  Also,  the  purposes  of  this 
paper  do  not  require  reference  to  observed 
response  values. ] 

4.1.  Scalar  Moments 

The  usual  sample  moments  of  the  predictors 
have  strong  properties.  The  means 

(4.3)  m(plc)  =  Ei  (ui-c)*  /  n 

have  first  and  second  derivatives  -pm(p-llc)  and 
p(p-l)m(p-2lc)  with  respect  to  c.  [The  symbol 
m('lc),  and  others  like  it  which  appear  below,  is 
to  be  interpreted  as  zero  if  the  argument  is 
negative. ] 

It  follows  i  Mediately  that:  (1)  if  p  is  odd, 
the  mean  is  monotone  decreasing  in  c  with  a 
unique  zero;  (2)  if  p  is  even,  the  mean  is 
convex,  so  it  uniquely  attains  an  minimum;  (3) 
these  zeros  and  minima  occur  between  mini  ui  and 
maxi  ui .  If  the  ui ’s  are  symmetric  about  their 
mean  0  =  m(llO),  then  the  zeros  and  minima  all 
occur  at  a,  suggesting  that  this  is  a  good  point 
of  expansion. 

The  predictor  covariances 

(4.4)  s(p,qlc)  =  m(p+qlc)  -  m(plc)  m(qlc) 
have  first  derivatives 

(4.5)  -p  s(p-l,qlc)  -  q  s(p,q-llc) 
and  second  derivatives 


i 


(4.6) 


p(p-l)  a(p-2,qlc) 

+  2pq  s(p-l.q-llc) 

+  q(q-l)  s(p,q-2lc) 

with  respect  to  c.  Since  (4.4)  is  zero  if  pq  =  0 
and  positive  when  pq  >  0  and  p+q  is  even  (see  the 
Appendix  for  a  proof) ,  these  _  covariances  have 
aonotonicity  and  convexity  properties  analogous 
to  those  of  the  predictor  Beans .  In  particular, 
if  the  ui ’s  are  syaetrically  placed  then,  at  c  - 
6,  s(p,qlc)  is  zero  for  p+q  odd  and  ainiaized  for 
p+q  even. 

The  behavior  of  the  predictor  correlations  is 
aore  intriguing.  In  the  interests  of 
tractability,  only  the  symaetric  case  is 
considered  here,  and  in  teras  of  the  coefficients 
of  deteraination 

(4.7)  r*(p,qlc)  =  s*(p,qlc)  /  s(p,plc)  s(q,qlc)  . 


to  praaise  results  of  this  kind.  The  anataay  has 
a  aatrix  foraulation. 

The  design  aatrix  of  the  order  P  a ingle- 
variable  polynoaial  regression  aodel  is,  froa 
(4.1)  and  (4.2),  the  nx(l+P)  aatrix 

(4.9)  X(c)  =  [(ui-c)p]  , 

where  i=l . n  and  p=0,l,...,P.  This  aatrix  has 

first  derivative  (see,  for  exaaple,  Rogers,  1980) 

(d/dc)  X(c)  =  [-p(ut-c)»-»] 

with  respect  to  c.  Let  J  be  the  (1+P)*(1+P) 
aatrix  with  -p  as  eleaent  (p-l,p)  for  p=l,...,P 
and  other  eleaents  zero,  that  is,  the  super¬ 
diagonal  aatrix  J  =  supdiag{-l . -P) .  Then 

(4.10)  (d/dc)  X(c)  =  X(c)J  . 


For  pq  =  0  or  p+q  odd,  (4.7)  clearly  has  an 
absolute  ainiaua  at  c  =  Q.  In  the  case  that  p+q 
is  even  (ignoring  the  case  that  p  =  q),  r^p.qlc) 
appears  to  have  a  relative  aaxiaua  at  c  =  0.  I 
can  prove  this  if  p  and  q  are  odd  (see  the 
Appendix),  and  I  conjecture  it  to  be  true  when  p 
and  q  are  even. 

Specific  illustrations  of  the  behavior  of 
these  coefficients  of  deteraination  are  available 
in  Bradley  and  Srivaatava  (1979)  and  Hackney  and 
Nohaaaad  (1978).  These  siailar  papers  include 
form  las  expressing  (4.7)  aa  a  rational  function 
with  coefficients  written  in  teras  of  the  Beans 
a('IS),  and  give  a  graphs  of  r2 (1,21 c)  for  two 
syMetric  situations.  The  graphs  were  produced 
by  assigning  to  the  ■( • Ic) 'a  the  values  of  the 
corresponding  population  aoaents  first  froa  the 
standard  normal  and  second  froa  the  rectangular 
distribution.  Hackney  and  Mohnmnd  also  plot 
r*(  1,31c)  for  these  situations  and  note  that  c  = 
Q  is  a  point  of  relative  aaxiaua.  Further,  they 
note  in  these  situations  the  patterns 

(4.8)  ra(p-2,qlu)  S  H(p,qlO)  s  r*  (p+l,q+llfl) 

(for  p+q  even)  aaong  the  correlations  of  the 
first  12  powers  of  the  centered  basic  variable, 
and  suggest  that  it  aay  therefore  suffice  in 
non-ajnetric  situations  to  choose  c  so  as  to 
achieve  a  minimal  value  for  the  highest-order 
(p+q  even)  deteraination  aaong  the  predictors 
being  used.  Since  I  think  the  nature  of  (4.7) 
needs  to  be  further  investigated  before  this 
suggestion  can  be  evaluated,  I  have  not  yet  tried 
to  prove  (4.8). 

4.2.  Matrix  Moments 

Although  the  aatheaatical  analysis  of  the 
scalar  momenta  of  the  predictors  (2.3)  is 
incomplete,  the  results  obtained  thus  far  do 
indicate  that  a  aore  encompassing  approach  would 
be  valuable.  What  is  required  is  to  study  the 
predictors  as  a  total  ensemble  rather  than  one  or 
two  at  a  time,  that  is,  to  consider  aatrix 
aoments.  Such  analysis  should  provide  guidance 
for  selecting  the  point  of  expansion  in  a  way 
which  better  controls  the  covariance  structure  of 
the  least-squares  estiaators. 

The  purpose  of  this  subsection  is  to  present  a 
particular  anatomy  of  the  predictors  which  se 


Using  results  from  elementary  aatrix  differential 
equations  (see,  for  exaaple,  Finkbeiner,  1966, 
Chapter  10)  on  (4.10),  or  by  direct  aatrix 
calculation,  one  can  show  that  X(c)  satisfies 


(4.11) 


X(c)  =  X(0)exp(Jc)  , 


where  exp(Jc)  denotes  the  aatrix  exponential 
function 


(4.12) 


exp(Jc)  =  Ep  J^cP/p! 

=  I  +  Jc  +J2cV2!  + 


+  Jpc|,/P!  , 


the  series  being  finite  here  because  J  is 
nilpotent  of  order  1+P. 

The  analogy  between  (4.12)  and  the  scalar 
exponential  function  is  obvious,  as  is  that  of 
(4.10)  and  (4.11).  In  further  analogy,  the 
inverse  of  (4. 12)  exists  and  is  given  by 


(4.13) 


{exp(Jc)}*1  =  exp(-Jc)  . 


Using  (4.12)  and  (4.13)  it  is  readily  shown  via 
mathematical  induction  that  the  eleaents  of  these 
matrices  are,  for  p,q=0,l . P, 


(4.14) 

and 


(exp(Jc))(p,q)  =  qCP  (-c)*»-p 


(4.15)  (exp(-Jc))  (p,q)  =  qCp  <*»-» 

where  «Cp  denotes  the  binomial  coefficient  q- 
choose-p.  In  particular,  these  matrices  are 
upper^triangular. 

One  can  also  express  the  relation  between  the 
parameter  estimates  in  (2.6)  by  the  algebraic 
equality 


(4.16) 


r(c)  =  exp(-jc)r(°) 


[Reference  here  is  to  the  least-squares 
estiaators;  coaputational  instabilities  in 
producing  numerical  estimates  are  for  the  aoaent 
being  ignored.]  It  is  apparent  from  (4.15)  and 
(4.16)  that  pA(Plc)  =  gA(PlO).  This  is  a 
reflection  of  the  well  known  fact  that  changing 
the  location  of  the  basic  variables  in  a 
polynomial  aodel  does  not  alter  the  eatiaates  of 
the  highest-order  paraaeters. 
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Equations  (4.11)  and  (4.16)  succinctly  state 
the  effect  of  taking  c  rather  than  zero  as  the 
point  of  expansion  for  the  polynomial  approach. 
Using  (4.11)  first  with  c  itself  and  then  with 
c=Q  gives 


X(c)  =  X(u)exp{J(c-fl) }  , 

which  after  pre-aul t ipl icat ion  by  a  vector  of 
(l/n)*s  yields  a  matrix  form  of  the  equations 
Bradley  and  Srivastava  (1979)  gave  for  computing 
the  means  (4.3).  If  one  changes  scale  by  a 
factor  d  after  changing  location  by  -c,  then  the 
right-hand  side  of  (4.11)  should  be  post- 
multiplied  by  the  diagonal  matrix 
D=diag{l,d,d2,...,dp}.  If  scale  is  changed  first 
then  a  pre-multiplication  is  required,  as  in 

X(d,c)  =  X(l,0)Dexp(Jc)  , 

which  is  the  general  form  of  the  triangular 
linear  transformation  which  Griepentrog,  Ryan, 
and  Smith  (1982)  considered  for  the  case  P=2. 

The  normal-equations  coefficient  matrix  and 
its  inverse  are  of  prime  interest  at  this  point. 
The  anatomy  summarized  by  (4.11)  gives  them  the 
forms 


(X'X)(c)  =  (X(c))'{X(c)} 

=  exp(J'c)(X'X) (O)exp(Jc) 

and 

(X‘X)-i(c)  =  exp(-Jc)(X'X) (O)exp(-J'c)  . 

The  fact  that  these  matrix  functions  have  such 
similar  forma  is  promising  for  further 
matheamtical  analysis.  Investigations  are 
incomplete  as  yet,  but  a  few  initial  indications 
can  be  given. 

An  easy  result  of  negative  nature  is  the 

following  one.  Since  J  is  nilpotent,  its 

eigenvalues  are  all  zero,  those  of  exp(Jc)  are 

all  exp(Oc)=l,  so  lexp(Jc)  1=1  and  l(X'2?)(c)f  = 
K*'X)(0)l  for  all  c.  This  shows  that  D- 

optimality  is  a  worthless  criterion  in  the 

single-veri able  polynomial  approach,  which  is  in 
marked  contrast  to  other  regression  situations 
(see,  for  example.  Bates,  1983). 

It  appears  that,  as  is  the  case  for  (X'X)(c), 
the  diagonal  elements  of  (X'X)-1(c)  are  convex  in 
c;  the  first  P  of  them  strictly  so,  the  last 
being  constant.  In  the  symmetric  case,  the 
minima  would  occur  at  c=Q.  These  results  would 
be  rather  useful,  since  the  diagonal  elements  of 
(X'X)~1(c)  appear  in  the  variances  of  the  least- 
squares  estimators,  and  although  I  have  not  yet 
achieved  proof  of  them  I  do  conjecture  them  to  be 
true.  Their  proof  and  the  discovery  of  other 
facts  about  this  inverse  matrix  will  likely  use 
some  properties  of  Vandermonde,  Henkel,  and 
Toeplitz  matrices. 


6.  FINAL  REMARKS 

Bradley  and  Srivastava  (1979)  recoMend  that 
non-essential  collinearity  in  a  regression  model 
be  reduced  by  appropriate  choice  of  a  point  of 
expansion  and  that,  when  possible,  collinearity 
inherent  in  the  basic  variables  be  removed  by 
good  experimental  design.  That  goal  —  joined 
with  a  belief  that  mathematical  analysis  can  give 
insight  into  the  effect  of  the  point  of  expansion 
in  the  polynomial  approach  —  is  the  motivation 
for  the  present  work. 

The  disagreements  about  the  benefits  and 
methods  of  data  centering,  in  particular 
Belsley's  (1984)  paper  and  the  caMents  on  it, 
nay  have  some  relevance  to  selecting  a  point  of 
expansion.  It  is  not  clear  what  conclusions  will 
ultimately  derive  from  such  debates,  not  for 
centering  of  predictors  and  certainly  not  for 
centering  of  basic  variables.  What  is  clear  is 
that  losing  sight  of  the  distinction  between  the 
model  and  any  given  description  of  it  is 
disastrous  to  understanding.  Herr  (1980)  has 
made  this  latter  point  by  comparing  the  geometric 
(or  coordinate-free)  and  algebraic  approaches  to 
linear  models.  Jacobowitz  and  Driscoll  (1980) 
give  a  more  mathematically  abstract  presentation 
which  distinguishes  between  the  properties 
inherent  in  the  model,  those  in  the  model  under  a 
particular  parametrization,  and  those  in  the 
parametrized  model  with  an  explicit  coordinate 
representation.  Such  distinctions  are  especially 
needed  when  discussing  and  using  the  polynomial 
approach  to  linear  regression  analysis. 
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APPENDIX 

This  appendix  sketches  the  more  detailed  of 
the  proofs  for  results  stated  in  the  body  of  the 
paper.  I  would  appreciate  receiving  information 
about  alternate  or  expanded  proofs. 

A. 1 .  Covariances  Positive 

When  p  and  q  are  positive  and  p+q  is  even, 
s(p,qlc)  is  bounded  below  by  the  covariance 
between  powers  p  and  q  of  lui-cl.  Applying  the 
leva  in  Gurland  (1967)  with  f(*)  and  g(-)  as  the 
corresponding  power  functions  and  X  as  a  random 
variable  taking  values  lui-cl  with  probability 
proportional  to  the  multiplicity  of  ui  among 


iu,...,Ur,  one  sees  that  a(p,qlc)  is  nonnegative. 
The  value  can  be  zero  only  if  the  lui-cl  are 
equal  for  all  i,  which  does  not  occur  since  (by 

assumption)  ui . un  contain  at  least  three 

distinct  values. 

A.2.  Relative  Maxima  in  r*(p,qlc) 

The  case  to  be  considered  is  pq  positive,  p+q 

even,  and  ui . un  syaetric  about  0.  Use  the 

derivatives  (4.5)  and  (4.6)  to  show  that 

t2(P>Qlc)  has  a  critical  point  at  c  =  0  and  that 
its  behavior  there  is  the  same  as  that  of  the 
function 

h(e)  =  s2 (p,qlc)  -  s(p.plc)  -  s(q,qlc)  . 

To  show  that  r2  (p,qlc)  has  a  relative  maximua  at 
u,  it  now  suffices  to  prove  that 

h"(S)  =  2  s(p,qlu)  s"(p,qlu) 

(A. 1)  -  s(p.piO)  s" (q,ql G) 

-  s(q,qlG)  s"(p,plfl) 

is  negative.  Here,  "  denotes  the  second 
derivative  with  respect  to  c,  so  that  s”(-,-ia) 
is  given  by  (4.6). 

Expand  (A.l)  using  (4.6)  to  express  h"(Q) 
entirely  in  terns  of  undifferentiated 
covariances.  Then  use  the  positive  definiteness 
of  the  covariance  matrix  {s(p,qlQ)}  to  show  that 
the  three  terms  having  covariances  between  powers 
p-1  and  q-1  as  factors  in  their  coefficients  have 
a  sun  which  is  negative.  This  step  reduces  the 
problem  to  showing  that 

po  s(p-2,qlu)  s(p.qia) 

(A. 2)  +  qo  s(p,q-2lU)  s(p,qlu) 

-  Po  s(p-2,pl0)  s(q.qlO) 

-  Qo  s(q,q-2lH)  s(p,plO)  , 

in  which  po  denotes  2p(p-l)  and  qo  denotes  2q(q- 
1),  is  nonpositive. 

Now,  reduce  to  the  subcase  that  p  and  q  are 
odd  and,  without  loss  of  generality,  take  p  less 
than  q.  The  resulting  simplification  in  (A. 2)  is 
that  all  occurrences  of  "s”  are  replaced  by  "a," 
so  that  one  may  directly  deal  with  predictor 
means.  At  this  point,  use 

n(p+qlc)  n(p+q-2lc) 

(A. 3)  £  n(2plc)  n(2q-2lc) 

£  n(2p-2lc)  a(2qlc) 

to  show  that  (A. 2)  is  nonpositive,  completing  the 
proof. 

The  moment  inequalities  in  (A. 3)  can  be 
established  by  judicious  use  of  a  result  of 
Sclove,  Simons,  and  van  Ryzin  (1967)  which  is 
listed  in  Patel,  Kapadia,  and  Owen  (1976,  p.47). 
The  facts  that  p  and  q  are  positive  and  that  p  is 
less  than  q  are  important. 

In  the  subcase  that  p  and  q  are  even,  one  can 
of  course  use  (4.4)  to  express  (A. 2)  explicitly 
in  terms  of  predictor  means.  But  I  have  not  been 
able  to  prove  in  this  subcase  that  (A. 2)  is 
nonpositive.  While  I  believe  it  is,  I  also 
believe  that  the  proof  will  be  delicate. 


A  WORKSTATION-BASED  ENVIRONMENT  FOR  STATISTICAL  ANALYSIS  OF  SET-VALUED  DATA 


Lionel  Galway.  Carnegie-Mellon  University 


Abstract 


Although  set-valued  data  exists  in  many  fields, 
statistical  techniques  for  analyzing  such  data  are  not 
well-developed.  One  problem  with  set-valued  data 
is  that  almost  any  non-trivial  analysis  makes  heavy 
computational  demands  and  requires  computer 
graphics;  such  usage  of  central  mainframe  computers 
can  be  quite  expensive.  We  report  on  the  current 
state  of  development  of  an  integrated  software 
package  to  analyze  two-dimensional  set-valued  data: 
it  includes  routines  for  manipulating  and  doing 
calculations  on  sets,  facilities  for  generating  pseudo¬ 
random  sets,  and  provisions  for  graphical  output  and 
user  interaction.  Although  much  of  the  package  has 
been  designed  to  be  system-independent,  it  takes 
full  advantage  of  the  unique  facilities  of  the 
ANDREW  window  manager  and  the  VICE  distributed 
file  system,  plus  the  availability  at  CMU  of  powerful 
workstations  with  graphical  input  devices  and  high- 
resolution  bit-mapped  graphics  displays  (ANDREW 
and  VICE  were  developed  by  the  Information 
Technology  Center  at  CMU  to  support  a  campus¬ 
wide  network  of  personal  workstations). 


1.  Introduction 

Statistics  is  concerned  with  random  quantities; 
traditionally  these  have  been  random  numbers, 
vectors  or  functions.  A  natural  extension  is  to 
study  random  sets  in  n-dimensional  Euclidean  space. 
This  is  important  from  a  theoretical  point  of  view 
and  from  a  practical  one  as  well:  data  from  a 
variety  of  fields  is  naturally  expressed  in  terms  of 
sets.  Examples  can  be  found  in  geology  (using  sand 
grains'  shape  and  size  distribution  to  determine  its 
provenance  (Ehrlich.et.al.,  7980)),  stereology 
(determining  a  three-dimensional  structure  from  two 
dimensional  slices  (Jensen,  et.al,  7985)).  and  from 
fields  such  as  computed  tomography,  granulometry, 
etc.  (Trader.  7987).  A  theory  of  random  sets  would 
make  possible  statistical  modeling  and  analysis  of 
this  data  in  a  natural  way. 

Significant  computing  facilities  are  required  to 
collect,  store,  and  analyze  set-valued  data;  lack  of 
such  facilities  has  been  a  significant  obstacle  to 
empirical  work  with  such  data.  This  paper  describes 
the  design  of  a  package  of  computer  programs  to 
analyze  and  display  both  random  and  deterministic 
sets  in  the  real  plane.  Attention  is  restricted  here 
to  the  two-dimensional  ease  for  two  reasons:  the 
geometrical  and  graphics  software  is  simplified,  and 
a  larger  amount  of  theory  is  available  for  two- 
dimensional  random  sets.  In  the  future  the 
algorithms  and  data  structures  will  be  generalized  to 
higher  dimensions  and  more  complex  set  structures. 

2.  Computational  Requirements  for  Analysis  of  Set¬ 
valued  Data 

The  statistical  analysis  of  real-  and  vector-valued 
data  has  a  long  history  of  collection,  plotting  and 
tabulation  which  predated  and  prepared  the  way  for 
probability  modeling  of  such  data.  In  contrast,  set¬ 
valued  data  requires  fairly  sophisticated  computing 
resources  to  collect  and  store,  and  to  do  almost 
any  non-trivial  analysis;  there  is  virtually  no  history 
of  data  analysis  on  set-valued  data  and  so  the 
theory  has  not  been  data-driven. 


For  example,  in  a  set  of  data  on  sand  grain 
shapes,  an  average  grain  profile  consists  of  about 
350  points.  Automated  scanning  equipment  is 
clearly  needed  to  digitize  any  useful  number  of 
grains.  In  addition,  a  calculation  of  the  storage 
requirements  for  a  small  sample  of  100  grains 
(assuming  32  bit  floating  point  numbers  for  each 
coordinate)  gives  100  (grains)  x  350  (points/grain)  x  2 
(x,y)  x  4  (bytes/coordinate)  *  273.5  Kbytes, 

approximately  the  capacity  of  a  floppy  disk.  The 
cost  of  disk  storage  on  mainframe  computers  for 
set-valued  data  sets  of  any  significant  size  would 
be  prohibitive. 

Analysis  of  a  data  set  also  requires  substantial 
cpu  time.  Algorithms  that  manipulate  geometric 
objects  such  as  the  digitized  sand  grains  must 
access  each  point  of  each  set  at  least  once  (and 
sometimes  more  often).  Finally,  analysis  of  set¬ 
valued  data  requires  graphics  input  and  output,  and 
high  resolution  graphics  devices  have  typically  not 
been  available  even  on  large  mainframe  computers 
except  at  high  costs.  Even  then,  the  use  of 
timesharing  environments  has  severely  limited  the 
performance  of  generally-available  graphics  systems. 

The  lack  of  such  equipment  has  meant  that  most 
work  on  random  sets  has  been  restricted  to 
theoretical  studies  (e.g.  (Matheron,  1975,  Artstein  and 
Vitale.  1975,  Cressie.  1979,  Eddy.  1980,  Trader.  1981, 
Eddy,  1982))  and  in  turn  that  little  intuition  or 
experience  with  real  set-valued  data  has  driven  the 
attempts  at  statistical  modeling.  However,  the 
recent  advent  of  powerful  personal  workstations  at 
a  fairly  low  cost  (e.g.  (Crecine.  1986))  has  brought 
together  adequate  computing  power,  disk  storage, 
and  high  resolution  graphics  I/O  together  in  a 
compact  package  which  can  be  dedicated  to  one 
person.  These  developments  suggest  that  for  the 
first  time  an  empirical  or  heuristic  approach  to  the 
statistical  analysis  of  set-valued  data  is  feasible. 


3.  S3:  Set  Statistical  System 

By  utilizing  a  computer  to  do  tedious  geometrical 
operations  on  sets,  a  user  can  quickly  and  easily 
carry  out  set  operations  such  as  union,  intersection 
or  Minkowski  addition,  much  as  data  analysts  in  the 
early  part  of  this  century  used  paper,  pencil,  and 
desk  calculators  to  construct  statistics  for  real- 
valued  data.  We  have  designed  a  software  system 
that  will  act  as  a  framework  for  experimenting  with 
the  analysis  of  set-valued  data.  The  central  goal  is 
to  provide  the  user  with  a  set  of  algorithms  which 
operate  on  planar  sets,  together  with  graphics 
support  and  a  flexible  user  interface  that  will  allow 
exploration  of  set-valued  data  and  construction  and 
evaluation  of  appropriate  statistics  suggested  by  the 
exploration.  These  three  components  are  discussed 
in  more  detail  in  the  next  three  sections;  this  is 
followed  by  a  section  discussing  ANDREW,  a  set  of 
software  enhancements  to  UNIX™  which  makes 
development  of  this  system  feasible  on  a 
workstation. 


3.1.  Set  Manipulation  Subroutines 

The  statistical  analysis  of  any  set  of  data 
typically  requires  the  computation  of  statistics  by 
combining  elements  of  the  data  with  appropriate 
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arithmetic  operations.  For  sets,  this  could  involve 
taking  unions,  intersections,  or  Minkowski  sums,  for 
example,  or  transforming  the  sets  in  various  ways. 
The  lowest  level  of  the  system  is  a  library  of 
programs  that  implement  operations  on  geometric 
objects  in  the  plane  (and  a  set  of  data  structures 
for  representing  those  geometric  objects).  These 
routines  will  form  the  basis  for  computing  statistics 
on  set-valued  data.  Careful  design  and 

implementation  of  these  basic  routines  is  essential, 
since  their  efficiency  will  determine  the  size  of  the 
sample  which  can  be  processed  in  a  reasonable 
time. 


3.2.  Graphics  Routines 

The  second  part  of  S3  is  a  library  of  routines  to 
support  graphical  display  of  planar  sets  using  a 
high-resolution  bit-mapped  display  and  graphical  user 
input.  Computer  graphics  allows  visualization  of 
random  sets  in  the  plane  and  the  display  of 
associated  function  representations  such  as  the 
support  function  and  the  boundary  function 
(Valentine,  1964). 

The  graphics  programs  fall  into  two  categories: 

1.  Routines  which  build  and  manipulate  an 
"abstract"  display  environment.  These 
programs  maintain  lists  of  geometric 
objects  to  be  displayed  and  information 
on  the  current  real  coordinate  system  and 
its  relation  to  the  graphics  device  in  use. 
These  routines  are  device-independent. 

2.  Programs  which  are  specific  for  the 
Andrew  programming  environment  and  its 
graphics  facilities. 


3.3.  User  Interface 

The  final  goal  is  a  package,  much  like  current 
statistical  packages  for  real-  and  vector-valued  data, 
which  will  allow  easy  entry  of  data  and  convenient 
user  interaction.  It  should  also  be  extensible  (like  S 
or  ISP)  to  allow  users  to  conveniently  compute 
tentative  statistics  and  test  their  performance  on 
real  and  simulated  data. 


3.4.  Andrew 

Andrew  is  a  system  of  hardware-independent 
extensions  to  the  UNIX™  operating  system,  written 
by  the  Information  Technology  Center  (Crecine.  1986, 
Morris.et.al.,  1986)  at  Carnegie-Mellon  University, 
which  runs  on  several  high  performance  workstations 
which  are  now  available.  It  allows  a  user 
application  program  to  manage  several  windows  on 
a  high-resolution  bit-mapped  display.  For  example, 
m  one  window  a  user  could  be  viewing  a  realization 
of  a  set-valued  random  process  in  the  plane,  another 
window  could  be  displaying  set-valued  statistics  for 
the  process  as  they  are  computed,  still  another 
window  could  be  displaying  the  process  in  some 
appropriate  function  space,  while  a  fourth  window 
could  be  used  to  issue  commands  to  affect  the 
process  or  statistical  computations.  ANDREW  also 
provides  support  for  menus  and  for  the  easy 
implementation  of  graphical  input  with  a  three-key 
mouse.  Although  all  of  the  set-manipulation  routines 
and  most  of  the  graphics  routines  are  written  to  be 
independent  of  a  particular  graphics  environment. 


much  of  the  utility  of  the  system  will  come  from 
the  features  of  the  Andrew  software.  In  particular, 
the  hardware-independent  nature  of  Andrew  will 
allow  use  of  these  programs  on  different  machines. 
Finally,  network  support  in  ANDREW  provides  access 
to  large  amounts  of  disk  storage  from  an  individual 
workstation,  freeing  it  from  the  necessity  of  storing 
all  of  the  needed  files  on  a  local  disk. 


4.  Current  Status  and  Future  Plans 

The  geometrical  routines  are  well-advanced, 
consisting  of  about  100  routines  that  range  from 
simple  coordinate  transformations  (e.g.  cartesian  to 
polar)  to  generating  pseudo-random  star-shaped  sets. 
Data-structures  and  conventions  are  fairly  stable  and 
well-defined.  Programmming  effort  continues  in 
extending  the  functions  available  and  programming 
more  efficient  algorithms  based  on  research  results 
in  computational  geometry  (e.g.  (Preparata  and 
Shamos,  1985)). 

The  graphics  routines  have  a  core  which  is  used 
to  test  the  geometrical  routines,  but  are  under  active 
development  as  we  experiment  with  various  ways  of 
using  the  graphical  interface  facilities  provided  by 
ANDREW.  In  particular,  we  are  planning  to 
implement  multiple  windows  in  our  programs  to 
allow  a  user  to  view  the  same  data  set 
simultaneously  in  several  different  ways. 

The  development  of  the  user  interface  is  still  in 
the  planning  stage,  since  we  are  just  beginning  to 
get  experience  with  exploring  set-valued  data  in  a 
workstation-type  environment.  One  option  under 
investigation  is  to  attempt  to  integrate  the  routines 
more  closely  with  an  existing  extensible  package 
such  as  S  (Becker  and  Chambers.  1984)  to  take 
advantage  of  its  user  interface  (which  already 
accomodates  graphical  I/O  to  some  extent). 

Finally,  we  have  at  hand  some  samples  of  set¬ 
valued  data  (such  as  the  sand  grain  data  mentioned 
above)  which  we  intend  to  explore  from  this  new 
perspective  using  these  new  tools. 

The  accompanying  figure  is  a  screen  dump  of 
the  Andrew  workstation  "pitneyfork"  running  a 
prototype  of  S3.  The  screen  displays  a  plot  of  50 
elements  from  a  set-valued  random  process  of  star¬ 
shaped  sets,  plus  a  plot  of  the  distribution  of  the 
number  of  vertices  per  set  using  S.  In  addition,  the 
text  editor  EMACS  is  being  used  to  modify  the  main 
routine  of  S3  and  the  window  labeled  console  is 
monitoring  the  performance  of  the  workstation. 
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Translating  Fortran  Programs  to  C 
Should  You  Do  It? 


David  Gray 
Statistics  Department 
Univeristy  of  Kentucky 


ABSTRACT 

This  paper  grew  out  of  my  experiences  in  converting  some  nonlinear  minimization 
subroutines  written  in  Fortran  to  C.  I  wanted  to  port  these  routines  to  a  microcom¬ 
puter  on  which  a  C  compiler  was  available  but  no  Fortran  compiler.  In  the  course  of 
this  I  tried  to  develop  some  automatic  translation  tools.  I  am  going  to  discuss  some  of 
the  issues  involved  in  this  process.  Hopefully  this  will  help  others  make  their  decision 
as  to  whether  they  wish  to  do  this. 


1.  Why  Convert  m  Fortran  Program  to  C? 

The  programming  language  C  is  very 
popular  and  its  popularity  seems  to  be  increas¬ 
ing  at  a  rapid  rate.  This  probably  stems  as 
much  as  anything  from  the  thousands  of  com¬ 
puter  science  students  trained  in  the  UNIX* 
environment.  Many  of  the  large  software  houses 
are  using  C  and  others  are  converting  to  it. 
Here  is  a  short  list  of  reasons  why  people  are 
converting  their  old  Fortran  programs  to  C. 

1)  It’s  possible  that  you  may  only  have  a  C 
compiler  available  on  your  microcomputer. 

2)  C  compilers  are  typically  the  first  language 
available  on  a  new  machine  (along  with 
Basic).  This  is  usually  because  the  the 
software  firm  that  wrote  the  operating  sys¬ 
tem  software  did  it  in  C.  For  example, 
the  Atari  520ST  had  a  C  compiler  avail¬ 
able  from  at  its  introduction  but  after 
almost  a  year  on  the  market,  no  Fortran 
compiler  is  available.  A  firm  that  wants 
to  get  a  jump  on  its  competition  or  needs 
to  port  software  to  a  new  machine  will 
have  an  advantage  if  its  software  is  writ¬ 
ten  in  C. 

3)  Most  large  software  projects  are  being 
written  in  C.  There  are  compiler  families 
that  allow  Fortran  and  C  code  to  be 

•  UNIX  is  a  trademark  of  AT&T. 


linked  together.  However,  this  can  be 
tricky  business  and  if  all  the  code  is  writ¬ 
ten  in  one  language  things  will  go 
smoother. 

4)  Many  new  programmers  know  C  from 
university  experience  with  UNIX.  Rather 
than  training  them  in  a  new  language, 
there  may  be  an  advantage  in  using  them 
in  their  ’natural’  environment.  I  have  been 
told  stories  of  freshly  minted  programmers 
balking  at  programming  in  anything  but 
C.  Of  course  this  a  management  problem, 
and  not  a  pure  programming  issue,  but  it 
is  important. 

5)  C  libraies  usually  offer  interfaces  to 
operating  system  and  hardware  services 
such  as  graphics,  memory  management  etc. 

6)  And  of  course  there’s  the  bandwagon 
effect:  everybody  else  is  doing  it. 

The  advantages  of  C  include: 

I)  Speed.  In  some  cases,  it  has  been  claimed 
that  C  is  only  2  to  3  times  slower  than 
programming  in  assembly  language.  How¬ 
ever,  this  is  clearly  dependent  on  the  qual¬ 
ity  of  code  generated  by  a  particular  C 
compiler.  In  the  microcomputer  world 
compilers  tend  to  produce  flat  footed, 
unimaginative  code.  Also,  as  the  programs 
grow  larger,  it  is  not  clear  that  the 
claimed  C  to  assembly  language  speed 


ratio  is  maintained. 

2)  Portability.  Portable  programs  are  possi¬ 
ble,  but  access  to  operating  system 
resources  usually  results  in  non-portable 
programs.  The  use  of  conditional  compila¬ 
tion  can  help  but  writing  portable  code  is 
an  art  that  few  of  us  are  good  at.  It  is 
also  debatable  whether  portability  is 
always  desirable. 

3)  Flexibility.  It  seems  to  be  true  that  if  it 
can  be  done,  you  can  do  it  in  C.  The  rich¬ 
ness  of  data  structures  may  be  the  most 
important  reason. 

The  advantages  of  Fortran  include: 

1)  Speed.  Fortran  compilers  have  a  general 
reputation  for  being  fast  with  respect  to 
execution,  especially  for  numerical  work. 
Again,  as  with  C,  in  the  micro  world  For¬ 
tran  compilers  may  not  deserve  that  repu¬ 
tation. 

2)  Portability.  Fortran  is  extremely  portable 
if  you  stay  away  from  compiler  specific 
extensions. 

3)  Flexibility.  This  is  Fortran’s  real  weak¬ 
ness.  Fortran  is  poor  at  handling  charac¬ 
ter  data,  it  has  limited  access  to  OS 
resources  and  very  limited  data  structures. 

So  it  appears  the  real  difference  is  the 
flexibility  of  C  over  Fortran.  I  believe  most  pro¬ 
fessional  programmers  will  agree  that  this  is  one 
of  the  chief  reasons  if  not  the  main  reason  for 
using  C. 

2.  What  makes  the  translation  from  For¬ 
tran  to  C  difficult? 

If  were  trivial  to  convert  Fortran  programs 
to  C,  it  would  have  all  been  done  by  now. 
Unfortunately,  it  isn’t  all  sweetness  and  light. 
There  are  many  problems  in  going  from  Fortran 
to  C,  some  easy  and  others  subtle  and  difficult. 
What  follows  is  a  by  no  means  complete  list  of 
some  of  the  problems  involved. 

1)  Fortran  is  context  sensitive.  If  you  come 
across  a  ’(’  in  a  Fortran  program  you  don’t 
know  whether  you’re  dealing  with  a  func¬ 
tion  or  an  array  without  extra  informa¬ 
tion.  It’s  more  work  to  keep  track  of  the 
needed  information  to  make  these  deci¬ 
sions.  Another  problem  is  that  blanks  are 
not  significant  in  Fortran.  It’s  possible  to 
write  totally  unreadable  Fortran  pro¬ 
grams.  Fortunately,  people  don’t  write 


such  programs,  at  least  not  intentionally. 

2)  Fortran  is  column  major  and  C  is  row 
major  with  respect  to  arrays.  Numerical 
algorithms  tend  to  take  advantage  of 
Fortran’s  column  ordering.  If  such  an 
algorithm  were  literally  translated  into  C 
on  a  machine  virtual  memory,  substantial 
performance  degradation  due  to  page 
faults  can  occur. 

3)  Not  all  functions  available  in  Fortran  are 
available  in  C.  This  is  a  fairly  straight 
forward  problem  to  solve  but  it  does  entail 
extra  work  and  testing. 

4)  Fortran  makes  no  distinctions  between 
parameters  and  variables.  That  is,  the 
variable  declartions  in  a  subroutine  don’t 
give  any  information  whether  a  value  or 
an  address  is  being  passed.  For  functions 
and  arrays  it  is  clear  that  they  are 
addresses.  This  is  a  difficult  problem. 

5)  Not  all  C  and  Fortran  data  types  match 
up.  For  example,  there  are  no  logical  or 
complex  types  in  C.  They  can  usually  be 
simulated  however. 

6)  C  promotes  single  precision  floating  point 
variables  to  double  precision  when  per¬ 
forming  arithmetic  operations.  This  is  an 
important  problem  for  the  numerical 
analyst  because  in  many  problems  most  of 
the  computation  is  done  in  single  precision 
and  only  such  things  as  residuals  are  kept 
in  double  precision.  The  speed  advantage 
of  single  precision  is  lost.  Some  C  com¬ 
pilers  optionally  do  not  promote  and  the 
ANSI  C  draft  may  make  this  standard. 

7)  C  does  not  guarantee  the  order  of  evalua¬ 
tion  of  terms  in  an  arithmetic  expression. 
For  example, 

(a  +  b)  +  (c  +  d) 

could  be  translated  by  the  compiler  to 
(a  +  d)  +  (b  +  c) 

If  the  order  of  evaluation  is  important,  e.g. 
to  avoid  overflow  or  in  a  calculation  where 
small  elements  must  be  added  before  large 
elements  to  avoid  cancellation,  accuracy 
may  be  lost.  Order  would  have  to  be 
forced  by  using  temporary  variables  to 
hold  intermediate  quantities. 
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8)  Arrays  in  C  start  with  index  0  while  in 
Fortran  they  start  with  1.  This  seems  sim¬ 
ple  to  fix,  but  with  the  wild  subscripting 
schemes  used  in  some  programs,  it  can  be 
confusing. 

9)  In  Fortran  subroutines,  arrays  may  be 
given  dimensions  passed  to  the  subroutine. 
In  C,  arrays  must  have  fixed  dimension, 
except  for  the  last  dimension.  One  way 
around  this  is  to  write  your  own  matrix 
allocation  subroutine  and  call  it  with  the 
passed  dimensions.  But  then  the  reference 
of  elements  of  the  allocated  array  can 
become  messy. 

Some  of  the  problems  in  this  list  are  easily 
amenable  to  automatic  translation,  at  least  of 
the  kind  I  mean  where  you  are  not  going  to 
spend  a  year  writing  your  translator.  Some  of 
them  appear  to  require  an  in  depth  understand¬ 
ing  of  the  original  Fortran  code  and  will  have  to 
be  resolved  manually. 

3.  Writing  your  own  translation  tools. 

In  attempting  to  write  some  tools,  I 
wanted  to  take  advantage  of  existing  tools  to 
make  my  life  easier.  Fortunately,  I  had  access 
to  a  Vax  11/750  running  4.2BSD  UNIX.  UNIX 
contains  many  tools  for  character  and  string 
manipulation,  which  much  of  translation  process 
is.  It  also  has  a  little  known  utility,  struct, 
which  is  of  great  value.  Struct  takes  a  Fortran 
program  and  translates  it  into  a  Ratfor  pro¬ 
gram.  Ratfor  is  a  dialect  of  Fortran  with  many 
C  like  constructs.  After  using  struct,  further 
processing  was  done  with  awk,  a  pattern  match¬ 
ing  and  string  substitution  language  that  can  be 
programmed  similarly  to  C  (  I  have  been  told 
that  sed,  the  stream  editor,  can  do  some  of  the 
things  that  I  was  using  awk  and  much  more 
quickly).  I  also  wrote  some  C  programs  in  those 
cases  where  awk  wasn’t  powerful  and/or  fast 
enough. 

The  awk  and  C  programs  did  such  things 
as: 

1)  Fix  goto  labels 

2)  Change  Ratfor  switch  statements  to  C 

switch  statments 

3)  Change  do  statements  to  for  statements 

4)  Change  parentheses  to  brackets.  This 

required  knowing  in  advance  the  function 
and  array  names.  I  used  grep  for  this  pur¬ 
pose. 


5)  Add  semicolons,  fix  up  comments,  etc. 

None  of  the  above  touch  on  the  more 
difficult  problems  mentioned  in  the  previous  sec¬ 
tion.  And  with  good  reason,  as  I  didn’t  want  to 
spend  a  year  writing  my  own  translator.  Also, 
it  is  important  to  note  that  I’ve  relied  on  the 
good  coding  practices  of  the  people  who  wrote 
the  Fortran  programs.  For  the  type  of  code  I 
deal  with  the  programs  are  well  written  and 
with  care.  This  makes  it  easier  to  write  transla¬ 
tion  tools  and  certainly  makes  my  simplistic 
tools  work. 

A  neater  and  more  powerful  way  to 
accomplish  the  above  and  even  more  is  to  use 
the  UNIX  utilities  yacc  and  lex.  Yacc  is  a 
parser  generator  that  has  been  used  in  the  con¬ 
struction  of  compilers  and  other  software.  Lex 
is  a  lexical  analyzer.  These  tools  along  with 
struct  and  awk  could  be  used  to  build  a  power¬ 
ful  translator.  However,  I  wouldn’t  go  so  far  as 
to  try  and  write  a  Fortran  to  C  compiler.  For¬ 
tran  must  be  one  of  the  hardest  languages  to 
write  a  compiler  for  and  you’ll  easily  be  spend¬ 
ing  your  year  (or  more)  doing  this. 

There  are  commercial  programs  and  ser¬ 
vices  available  for  Fortran  to  C  source  transla¬ 
tion.  They  tend  to  be  expensive  and  a  recent 
rewiew  of  one  of  these  programs  reported  that 
the  resulting  C  code  was  a  literal  translation  of 
the  Fortran  source  code  and  that  the  translator 
program  broke  on  large  programs.  It  appears 
that  the  people  who  do  this  for  a  living  have 
problems  too. 

4.  An  example  of  simple  translation. 

In  this  example,  a  Fortran  fragment  will 
be  translated  into  C  code.  The  orginal  fragment 
was: 

DO  20  1=1, L 
1I=L  -  I  +  1 
20  S(I)=R(1I,L) 

After  applying  struct  we  get, 

do  i  =  1,1  { 
ii=l-i+l 
s(i)=r(ii,l) 

) 

After  running  an  awk  program  to  convert 
do  statements  to  for  statements,  we  get 
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for(i=l-l;i<l;i++)  { 
ii=l-i+l 
s(i)=r(ii,l) 

} 

Note  that  the  ’1-1’  in  the  for  statement  is 
obnoxious,  but  I  am  taking  advantage  of  the 
compiler  to  fold  the  term  into  ’O’.  Next  using 
tools  to  change  brackets  into  parentheses  and 
add  semicolons,  the  final  version  of  the  fragment 
is, 

for(i=l-l;i<I;i++)  { 
ii=l-i+l; 
s[i]=r[ii][l]; 

} 

Rather  than  run  these  programs  individu¬ 
ally,  they  can  be  put  into  a  shell  script  and  exe¬ 
cuted  as  a  single  command. 

6.  What  Do  You  Have  When  You’re  Done? 

Typically,  what  you  have  when  you’re 
done  is  a  Fortran  program  written  in  C.  This  is 
especially  true  when  the  process  is  highly 
automated.  None  of  the  unique  C  constructs 
will  be  used.  In  some  cases  this  is  all  you  can 
ask  for  as  some  programs  can  really  only  be 
written  one  way,  Fortran  or  C.  But  you  may  be 
disappointed  as  the  resulting  code  is  not  very 
exciting.  You  get  what  you  pay  for. 

At  worst,  you  may  have  an  inefficient  pro¬ 
gram  that  doesn’t  produce  the  same  results  as 
the  original.  This  too  may  be  a  result  of  highly 
automated  translation  or  mistakes  made  in 
manual  translation. 

The  moral  of  the  story  is  to  TEST  the 
resulting  C  program.  A  comment  was  made  at 
the  conference  that  one  reason  NOT  to 
translate  Fortran  programs  to  C  is  that  the  For¬ 
tran  routines  have  withstood  the  test  of  time.  A 
new  C  version  would  also  have  to  undergo  the 
same  tests. 

To  improve  performance,  accuracy  or  take 
advantage  of  C  feature,  fine  tuning  by  hand  will 
be  necessary.  This  means 

1)  You  know  Fortran. 

2)  You  know  C. 

3)  You  understand  the  program  or  underlying 

algorithm. 


You  might  decide  it  is  better  to  start  from 
scratch  in  C. 

6.  Conclusions 

After  first  being  very  enthusiastic  on  the 
idea  of  automatic  translation  of  Fortran  to  C,  I 
am  now  more  cautious.  You  can  translate  some 
Fortran  programs  or  subroutines  into  good  qual¬ 
ity  C  programs  or  functions.  What  I’ve  learned 
is  that  you  can’t  be  too  greedy.  My  conclusions 
come  down  to  the  following: 

1)  It  is  possible  to  automate  much  of  the 
tediousness  out  of  the  Fortran  to  C  trans¬ 
lation  process  without  too  much  effort. 
With  a  lot  of  effort  you  can  just  about 
automate  the  whole  process. 

2)  You  should  have  a  good  reason  to  convert 
a  Fortran  program  to  C.  Fortran  will 
often  out  perform  C  in  execution  time  and 
Fortran  is  very  portable.  And  there  are 
compiler  families  that  allow  Fortran,  C 
and  other  languages  to  be  mixed. 

3)  You  cannot  treat  this  as  a  black  box,  espe¬ 
cially  with  numerically  sensitive  routines. 
The  resulting  C  program  may  perform 
unacceptably  and  you  will  have  to  fix  it 
manually.  I  would  never  take  a  large  For¬ 
tran  program,  automatically  translate  it 
and  assume  it  will  perform  correctly 
without  testing. 

4)  The  typical  size  of  routine  that  I  translate 
semi-automatically  is  100  to  300  lines. 
This  is  a  size  that  allows  me  to  understand 
what  is  going  on  inside  the  program  and 
also  feel  confident  that  I  can  do  the  neces¬ 
sary  manual  labor  involved.  Fortunately, 
this  is  a  ’standard’  size  routine  that  is 
found  in  numerical  work. 
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UPPER  AND  LOWER  PROBABILITY: 

A  GENERAL  FRAMEWORK  FOR  MODELING  UNCERTAINTY 

Yves  L.  Grize,  AT&T  Bell  Laboratories,  Holmdel,  NJ  07733 


Abstract 

This  paper  reviews  the  theory  of  upper  and  lower 
probability,  also  called  interval-valued  probability. 
Interest  in  this  theory  has  recently  been  stimulated 
by  the  attempts  to  apply  the  theory  of  belief 
functions  to  artificial  intelligence,  especially  expert 
systems  designs.  Upper  and  lower  probabilities 
provide  an  attractive  general  framework  for 
modeling  uncertainty  because  of  their  large  scope  of 
interpretations.  They  include  not  only  conventional 
probabilities,  belief  functions  and  envelopes  of 
measures  but  also  new  uncertainty  functions  that 
cannot  be  related  to  conventional  probabilities.  The 
problem  of  the  numerical  complexity  of  the  theory  is 
discussed.  A  family  of  lower  probabilities  that  are 
easy  to  use  and  should  be  large  enough  for  most 
practical  applications  is  described.  Some  limitations 
of  the  theory  will  also  be  discussed. 

I.  INTRODUCTION 

The  need  of  mathematical  models  for  reasoning 
under  uncertainty  has  been  emphasized  in  the  recent 
literature  in  artificial  intelligence  and  related  fields 
(e.g.,  see  LFCAI  [1985]).  Useful  abstract  models  for 
reasoning  under  uncertainty,  namely  probabilistic 
reasoning  models,  should  be  based  on  a  concept  of 
probability  that  is  supported  by  a  mathematical 
structure. 

In  this  paper  we  review  the  mathematical 
structure  of  upper  and  lower  (U/L)  probability  also 
called  interval-valued  probability  and  describe 
different  uncertainty  models  that  can  be  constructed 
from  this  structure.  Because  of  their  large  scope  of 
interpretations  U/L  probabilities  provide  a  general 
framework  for  modeling  uncertainty.  They  include 
not  only  conventional  probabilities,  belief  functions 
and  envelopes  of  measures  but  also  new  uncertainty 
functions  that  cannot  be  related  to  conventional 
probabilities. 

Interest  in  U/L  probabilities  has  recently  been 
stimulated  by  the  attempts  to  apply  the  theory  of 
belief  functions  to  the  design  of  expert  systems. 

II.  THE  THEORY  OF  UPPER  AND 
LOWER  PROBABILITY 


11.1  The  Basic  Theory: 

Throughout  the  paper  0  denotes  a  finite  set. 
Two  numbers,  E(A)  and  P(A),  called  respectively 
the  lower  and  upper  probability  of  A,  are  assigned  to 
each  subset  A  of  ft.  These  two  set  functions  must 
satisfy  the  following  axioms  (e.g.,  see  Good  [1962]): 

•  Axiom  1  (Normalization):  E(ft)  =  1 

•  Axiom  2  (Nonnegativity):  (VA)  E(A)  ss  0 

•  Axiom  3  (Conjugacy):  (VA)  E(A)  +  P(A)  =  1 

(A  denotes  the  complement  in  ft  of  A) 

•  Axiom  4  (Sub-  and  Superadditivity): 

(VA,B  ADB=0) 

P(A)  +  P(B)  a:  P(AUB)  (sub) 

E(A)  +  E(B)  s  E(AUB)  (super) 

Elementary  consequences  of  these  axioms  include 
the  following: 

1.  (VA)  E(0)_=  P(0)  =  0s  E(A)  s  P(A)  1 
=  E(ft)  -  P(ft) 

2.  (VA,B)  ACB  =>  E(A)  <;  E(B)  and  P(A)  s 
P(B) 

Observe  that  if  E  =  P  =  P  then  P  is  a  (finitely 
additive)  probability  measure. 

Because  of  Axiom  3  one  set  function  on  2n 
completely  determines  the  other.  Therefore,  without 
loss  of  generality,  the  entire  theory  can  be  expressed 
in  terms  of  only  one  of  them.  From  now  on  our 
discussion  will  be  phrased  in  terms  of  the  lower 
probability  only,  as  it  is  customary  in  the  literature.. 

A  lower  probability  can  be  defined  independently 
of  its  associated  upper  probability  as  follows: 

A  lower  probability  on  2n  is  a  normalized,  non¬ 
negative  set  function  such  that  E(0)  =  O  and 
E(A)  +  E(B)  =s  E(AOB)  +  E(AUB)  for  all  pairs 
(A,B)  of  sets  such  that  AClB=0  or  AUB=ft. 

11. 2  Motivation  for  the  Axioms: 

A  natural  motivation  for  the  axioms  1-4  comes 
from  the  behavior  of  the  relative  frequency  f„  of 
occurrence  of  an  event  E  in  a  sequence  of  n 
independent  repetitions  of  an  experiment  (Walley  & 
Fine  [1982]).  Since  {f„}  is  a  bounded  sequence  of 
real  numbers  it  always  has  an  inferior  and  a  superior 


limit.  One  then  defines  E(E)  =  liminf  fn  and  P(E) 
=  limsup  f„.  The  assumption  about  the  convergence 
of  the  relative  frequency  to  the  probability,  needed 
to  justify  the  axioms  of  conventional  probability,  is 
no  longer  necessary. 

II. 3  A  Simple  Example: 

To  illustrate  how  a  lower  probability  could  be 
used  to  model  uncertainty  let  us  consider  the 
following  simplified  situation  of  a  medical  diagnosis: 
Let  ft  =  {a,b,c}  where  a,  b  and  c  are  three 
symptoms:  a:  to  have  a  runny  nose,  b:  to  have 
irritated  eyes,  c:  to  sneeze  once  in  a  while.  Further, 
suppose  that  an  expert  tells  us  that: 

-  one  symptom  alone  does  not  indicate  an  allergy, 

-  the  three  symptoms  together  surely  indicate  an 
allergy, 

•  two  symptoms  indicate  a  middle  state  of  indecision 
about  the  presence  of  an  allergy  (i.e.  in  common 
language  there  is  a  50%  "chance"  of  an  allergy). 

How  can  we  model  the  uncertainty  in  an  allergy 
diagnosis  based  on  an  observation  A  in  ft? 

A  natural  answer  is  the  set  function  E  defined  by: 

E(n)  =  l, 

E({a,b})  =  E({b,c})  =  E({a,c})  =  y, 

E({a})  -  E({b})  =  E(W)  =  E(0)  =  0. 

Observe  that  E  is  not  a  probability  (in  fact  it  is  not 
even  a  belief  function,  as  defined  in  Ill.l.f). 
However,  it  is  easy  to  see  that  E  is  a  lower 
probability. 

III.  UNCERTAINTY  MODELS  BASED  UPON 
LOWER  PROBABILITIES: 

The  mathematical  structure  of  U/L  probabilities 
can  be  used  as  a  basis  for  various  uncertainty  models 
each  one  corresponding  to  a  different  type  of  lower 
probability.  We  first  define  these  different  types  of 
lower  probabilities  and  then  discuss  the 
corresponding  models  and  mention  some  of  their 
applications. 

III.  1  Classification  of  Lower  Probabilities; 

Definition:  Let  E  be  a  lower  probability  on  2n. 

a.  A  probability  measure  u  dominates  E  if  (VA) 
ji(A)  at  E(A).  It  follows  that  p.(A)  s  P(A). 

b.  The  class  Mp  of  all  probabilities  that  dominate 
E  is  called  the  class  of  dominating  probabilities 


c.  If  Mp  is  empty,  E  is  called  undominated, 
otherwise  it  is  dominated. 

d.  E  is  a  lower  envelope  if 

(VA)  E(A)  =  inf  V(A)  :  u  e  Mp}. 

e.  A  lower  probability  E  is  monotone  of  order  k  if: 
(VAj,  .  .  .  ,  Ak) 

EOJAj)  2  2  (-1)|I|+1E  (pAj). 

In  particular  E  is  monotone  of  order  2  (or  2- 
monotone)  if: 

(VA,B)  E(A)  +  E(B)  =s  E(AUB)  +  E(Af"lB). 

f.  E  is  a  belief  function  if  it  is  monotone  of  order  k 
for  ail  k.  Every  set  function  E  on  2n  can  be 
written  as  (VA)  E(A)=  2  m(B)  for  some 

BCA 

function  m.  It  turns  out  that  E  is  a  belief 
function  if  and  only  if  m  is  non-negative  with 

m(0)=O  and  2  m(B)  =1.  m  is  called  the 
Ben 

basic  probability  assignment  of  the  belief 
function  E- 

Denote  by  P,  B,  M 2,  LE,  D,  U  and  LP  the  classes 
of  probabilities,  belief  functions,  2-monotone  lower 
probabilities,  lower  envelopes,  dominated  lower 
probabilities,  undominated  lower  probabilities  and 
lower  probabilities  on  2n.  It  has  been  shown  (e.g., 
see  W alley  &  Fine  [1982])  that: 

PCBCM2CLECDC  DU  U  =  LP 

and  that  if  |ft|  a  7  all  these  inclusions  are  strict. 

We  now  discuss  the  uncertainty  models  resulting 
from  this  hierarchy  of  lower  probabilities. 

III.  2  Probability-Based  Models  (P-Models): 

We  have  already  pointed  out  that  probabilities 
are  a  special  case  of  lower  probabilities.  P-models  are 
well-known  and  will  not  be  further  discussed. 

III. 3  Belief  Function-Based  Models  (B-Models): 

B-models  are  the  lower  probability  based  models 
that  have  received  the  most  attention.  They  arose 
from  the  work  of  Dempster  [1967]  on  multivalued 
mapping  and  were  later  extended  by  Shafer  [1976] 
(see  also  Shafer  [1982a]). 

Belief  functions  are  interpreted  through  their 
basic  probability  assignment  function  m:  in  light  of  a 
piece  of  evidence,  m(A)  is  that  portion  of  a  person's 
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total  belief  (of  value  1)  exactly  committed  to  A  and 
to  none  of  the  proper  subsets  of  A  ("intrinsic" 
belief).  A  reason  for  the  popularity  of  B-models  is 
that  they  can  be  combined  together  through  their 
m-functions  as  different  pieces  of  evidence  are 
collected  using  the  so-called  "Dempster’s  rule  of 
combination"  (see  Shafer  [1976]).  B-models  have 
been  used  in  a  variety  of  fields  such  as  psychology 
(e.g.,  Krantz  and  Miyamoto  [1983]),  statistics  (e.g., 
Shafer  [1982b]),  computer  vision,  risk  assessment 
and  medical  expert  systems  (e.g.,  Gordon  & 
Shortliffe  [1984]). 

111.4  Lower  Envelope-Based  Models  (LE-Models): 

LE-models  arise  whenever  the  uncertainty  is 
conveniently  described  by  a  class  of  probability 
measures  M.  Indeed  any  such  M  induces  a  lower 
envelope  by  E(A)  =  inf  (p.(A)  :  p.c  M  }.  In  general 
the  set  Mp  of  dominating  measures  will  be  larger 
than  the  class  M  that  induces  the  lower  envelope. 
Some  applications  of  LE-models  are  mentioned: 

1.  In  the  theory  of  robust  statistics  LE-models 
have  been  used  to  describe  neighborhoods  of 
probability  distributions  (Huber  [1981]). 

2.  When  expert  opinions  are  represented  by 
probabilities,  lower  envelopes  provide  a  simple 
way  to  aggregate  these  opinions  into  one  set 
function  (Walley  [1982]).  Thorp  et  al.  [1982] 
give  an  example  of  a  LE-modcl  used  to 
forecast  production  cost  in  an  electric  utility. 

3.  Finally  a  personalistic  account  of  uncertainty, 
based  on  a  notion  of  "coherency",  has  been 
developed  using  lower  envelopes,  in  a  similar 
way  as  it  is  done  using  probabilities  (see  Walley 
[1981]  and  the  references  therein).  This 
approach  allows  to  model  the  inherent 
imprecision  in  a  person’s  beliefs. 

111. 5  Dominated  Lower  Probability-Based  Models 
(D-models): 

All  the  models  discussed  so  far  are  based  on 
dominated  lower  probabilities  but  the  class  of  D- 
models  itself  has  not  yet  been  studied  specifically.  It 
is  difficult  to  interpret  dominated  lower  probabilities 
that  are  not  lower  envelopes,  except  as  a  vague 
description  of  an  underlying  probability  measure  p. 
such  that  E  s  (i  s  p. 


III. 6  Undominated  Lower  Probability-Based 
Models  (U-models): 

Undominated  lower  probabilities  provide  a 
completely  new  framework  for  modeling  uncertainty 
since  they  cannot  be  related  to  usual  probability 
measures.  Grize  and  Fine  [1986/7]  have  shown  that 
U-models  can  be  constructed  to  describe  stationary 
processes  with  bounded  and  divergent  (i.e. 
fluctuating)  time  averages,  while  the  modeling  of 
such  processes  is  impossible  in  standard  probability 
theory  (contradiction  with  the  ergodic  theorems). 
The  fact  that  processes  with  the  above-mentioned 
properties  seem  to  exist,  as  data  on  the  frequency 
fluctuations  of  quartz  crystal  oscillators  show  (see 
Grize  [1984]  for  details),  strongly  motivates  the 
study  of  U-models. 

The  use  of  U-models  remains  so  far  conceptual. 
The  interpretation  of  undominated  lower 
probabilities  in  terms  of  observable  data  is  still  an 
open  problem. 

IV.  THE  COMPLEXITY  OF  THE  THEORY  - 
PRACTICAL  CONSIDERATIONS: 

Let  |ft|  =  n.  To  define  a  probability  p.  on  ft  it 
suffices  to  specify  the  values  of  ji  on  the  n  atoms  of 
ft,  but  for  a  lower  probability  £  the  values  of  E  must 
be  given  for  each  of  the  2n  subsets  of  ft. 

The  classes  P  and  LP  are  closed  and  bounded 
convex  polyhedrons  in  the  2n  dimensional  space  R2  . 
P  has  n  extreme  points.  An  idea  of  the  complexity  or 
richness  of  the  structure  of  lower  probabilities  is 
gained  by  examining  the  extreme  points  of  LP.  The 
number  of  extreme  lower  probabilities  grows  very 
rapidly  with  n  and  already  exceeds  10  million  when 
n=10  (see  Grize  [1984]  for  details).  If  n  =  3,  there 
are  8  extreme  lower  probabilities:  7  belief  functions 
and  the  lower  probability  of  paragraph  II. 3. 

To  have  a  useful  theory,  a  way  must  be  found  to 
avoid  having  to  define  E  for  every  set.  A  large  class 
of  lower  probabilities  that  include  all  the  types 
discussed  above  and  that  is  easy  to  use  has  been 
proposed  in  Grize  [1984],  Such  lower  probabilities 
are  defined  by  way  of  a  family  G  of  sets  with  the 
property  that  any  collection  of  2m-2  elements  of  G 
has  a  non-empty  intersection,  where  m  is  a  given 
integer  greater  than  1.  For  a  set  A,  E(A)  is 
determined  by  the  smallest  number  of  sets  in  G 
whose  intersection  lies  in  A.  More  precisely: 


(VACft)  E(A)  = 


1  if  A  =  n 

1  -  —  if  A  t  Cl  and  (3BeG,)  BCA 
m 

1  -  —  if  (VBeG,)B^A  and  (3B«G2)  BCA 
m 

—  if  (VBeG^)  BftA  and  GBeG*.,)  BCA 

m 

0  otherwise 
where:  Gj  =  G  and 
Gk  =  n  (G)j 

=  {BCfl  :  (3B1,B2,..,Bk«C)  b=  B,nB2n...nBk}. 

It  is  easy  to  check  that  E  is  a  lower  probability. 
This  class  of  lower  probabilities  should  be  large 
enough  for  most  practical  applications.  The  lower 
probability  of  section  II. 3  is  an  example  of  a  lower 
probability  defined  in  this  fashion  with  G  =  {  [a,b), 
{b,c},  [a,c] }  and  m  =  2. 


V.  CONDITIONAL  LOWER  PROBABILITIES: 


A  satisfactory  answer  to  the  question  of  defining 
conditional  lower  probabilities  is  yet  to  be  found.  It 
is  beyond  the  scope  of  this  paper  to  present  a  full 
discussion  of  this  issue  and  we  shall  limit  ourselves  to 
briefly  mention  some  of  the  various  forms  of 
conditioning  that  have  been  proposed  so  far  (see 
Walley  ([1981]  for  more  details): 


•  For  probabilities  define: 

P(A|B)  = 


P(AflB) 

P(B) 


•  For  belief  functions  the  conditioning  is  expressed 
in  terms  of  the  upper  probability  (Shafer  [1976]): 

P(A|B)  = 

P(B) 

or:  _  _ 

E(A|B)  = 

l-E(B) 

•  For  a  lower  envelope  E  given  by  E(A)  =  {  inf 

p.(A)  :  define  (e.g.,  Walley[1981]): 


E(A|B)  =  { inf  p.(A|B)  :  ti«MP}. 


If  E  is  2-monotone  E(A|B)  can  be  written  as: 


e(a|b)  = 


E(AOB) 

E(ADB)  +  P(BDA)  ' 


conditional  lower  probabilities  is  disturbing.  There  is 
little  doubt  that  this  question  needs  to  be  solved  for 
the  theory  to  be  successful  in  areas  such  as  expert 
systems  or  artificial  intelligence.  In  our  opinion  the 
issue  of  conditioning  is  the  major  limitation  of  the 
theory  of  lower  probability  today. 

VI.  CONCLUSION: 

•  The  mathematical  structure  of  upper  and  lower 
probabilities  unifies  various  uncertainty  models 
and  is  an  elegant  generalization  of  the  classical 
theory  of  probability, 

•  Upper  and  lower  probabilities  provide  a  general 
framework  for  modeling  uncertainty  where  the 
uncertainty  described  by  conventional  probability 
is  only  a  degenerate  case  (E  =  P), 

•  Upper  and  lower  probabilities  have  solid 
mathematical  foundations  hence  are  adequate  for 
rigorous  developments, 

•  Although  upper  and  lower  probabilities  are,  in 
general,  difficult  to  specify  numerically,  a  class 
that  is  easy  to  use  and  large  enough  for  most 
practical  purposes  has  been  identified. 

However: 

•  An  intuitive  interpretation  of  upper  and  lower 
probabilities  that  are  not  envelopes  of 
probabilities  is  still  missing  (especially  for 
undominated  lower  probabilities), 

•  The  methods  for  conditioning  upper  and  lower 
probabilities  are  still  unsatisfactory  (What  is  the 
right  way  to  do  it?) 
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THE  NUMERICAL  SOLUTION  OF  A  SYSTEM  OF  ORDINARY  STOCHASTIC  DIFFERENTIAL 
EQUATIONS  ON  THE  CYBER  205  SUPERCOMPUTER 

Tim  Haas,  Colorado  State  University 


Summary 

A  numerical  method  using  an 
Ornstein-Uhlenbeck  approximation  to  the  Wiener 
Process  coupled  to  a  Runge-Kutta  algorithm  has 
been  programmed  on  the  Cyber  205  supercomputer 
to  solve  systems  of  ordinary  stochastic 
differential  equations  (OSDE’S).  Each  equation 
may  be  any  programmable  function  of  the 
independent  and  dependent  variables  optionally 
multiplied  by  a  stochastic  process  and/or  with 
an  additive  stochastic  process.  Equations  with 
analytical  solutions  are  solved  and  errors 
presented.  Cyber  205  timings  are  presented 
along  with  remarks  concerning  the  vectorization 
of  the  code. 

I.  Introduction 

Although  the  theory  of  stochastic 
differential  equations  (SDE’S)  has  received  much 
attention  [I,  2],  the  numerical  solution  of 
these  equations  having  no  known  analytical 
solution  has  not  been  subject  to  similar  effort. 
One  reason  may  be  the  expensive  computation  that 
solutions  of  these  equations  require. 

Many  models  in  the  physical,  life  and  social 
sciences  may  be  beneficially  recast  in  an  SDE 
form.  This  form  may.  however,  be  analytically 
intractable,  thus  a  numerical  approach  is  the 
only  recourse.  The  program  SDESS  (Stochastic 
Differential  Equation  System  Solver)  was 
developed  as  an  attempt  to  answer  this  need. 

Specifically  a  program  written  by  M.  Elrod  at 
the  University  of  Georgia  on  a  CDC  7600  [3]  to 
solve  a  single,  particular  SDE  was  modified  to 
solve  a  general  system  and  adapted  to  the  Cyber 
205. 

II.  Equation  Form 

The  type  of  equation  system  solvable  by  SDESS 
is  of  the  form 

tl  i  =  MOMy.*)  +  V*) 

dt 

where, 

y.  is  the  i  th  dependent  variable 

t  is  the  independent  variable 

f.,  '//j  are  stochastic  processes  which  may  be 

nonstationary. 

Fj  is  any  FORTRAN  programmable  function  of 

the  vector  of  dependent  variables  and  the 

independent  variable. 

Since  a  fourth  order  Runge  Kutta  method  with 
no  variable  step  size  capability  is  used,  the 
SDE’s  should  not  be  "stiff"  although  this  may  be 
difficult  to  predetermine  without  knowing  the 
effect  of  the  stochastic  processes  on  the 
solution. 

III.  Input  for  SDESS 

Input  for  SDESS  consists  of  two  types;  source 
code  modification  and  input  file  creation. 

The  first  type  consists  of  programming  the 
statement  functions  for  the  covariance  matrices 
for  the  two  stochastic  processes,  and  the 
statement  functions  for  the  equation  functions. 


This  nay  seem  unnecessarily  cumbersome,  but  is 
crucial  for  efficient  use  of  the  205  as 
described  below. 

The  second  type  of  input  consists  of  the 
input  file  which  contains  a)  the  number  of 
realizations  of  the  solution  desired,  b)  number 
of  independent  variable  steps  before  the  first 

four  moments  of  y  are  computed,  c)  the  number  of 
covariances  (number  of  values  of  the  independent 
variable  equally  spaced  at  which  covariances  are 
to  be  computed),  d)  the  sizes  of  the  covariance 
matrices  for  the  stochastic  processes. 

SDESS  is  written  as  a  subroutine  so  that  the 
input  file  is  passed  to  SDESS  through  a  "CALL” 
statement . 

IV.  Output  From  SDESS 

SDESS  will  output  the  following  at  the 

requested  time  points,  tj.tg . tfinal=  t*1C 

mean  value  of  yi .  b)  the  covariance  matrix  of 
(cov(y.(t<),y. (t.  ))),  c)  the  mean  value  for  each 
of  the  stochastic  processes  and  d)  the 

covariance  matrix  for  each  stochastic  process 
and  e)  the  mean  square  error  of  the  estimated 
mean  and  covariance  of  each  stochastic  process. 

V.  Verification  and  Test  Runs. 

Two  different  equations  have  been  run  using 
SDESS.  The  first  is  the  equation  (equation  A) 
used  by  Elrod  in  his  dissertation  to  verify  the 
program.  In  SDESS  a  two  equation  system  was 
created  by  coding  this  equation  twice.  The 
second  equation  (equation  B)  was  selected 
because  of  its  oscillating  and  analytical 
solution.  Here  also,  the  same  equation  was 
coded  twice  to  create  a  two  equation  system. 

The  equation  used  by  Elrod  is: 

dy  f  (t)y  +  *{t) 
dt  ~ 

where 

<£(t)>  =  .5 
<^{t)>  =  .5+sin(2wt) 
y(o)  is  N( 1.1/3) 

The  covariance  matrices  of  f(t)  and  <!>( t)  are 
given  by.  ,  . 

cov(t.t')  =  exp(- | t-t  |) 

g 

Elrod  ran  the  above  using  10  realizations. 
During  initial  development  of  the  code  on  a 
VAX- 11/750  with  a  floating  point  accelerator, 
running  in  core  this  solution  was  essentially 
duplicated  with  a  double  precision  version  of 
the  code.  The  run  time  was  about  160  hours  at 
about  95%  epu  usage.  In  order  to  save 
computation  expense  the  two  equation  system  of 

the  above  was  run  on  the  205  using  only  10 
realizations.  The  results  were  within  expected 
accuracy,  see  figs.  1  and  2  and  Table  1  for 
solution  and  error  values  from  Elrod's,  the  VAX 
and  205  runs. 

The  second  equation  was  chosen  to  verify  that 
SDESS  would  not  always  allow  a  solution  to  "blow 
up”,  a  behavior  suspected  due  to  the  potentially 


& 
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imperfect  simulation  of  a  white  noise  process 
inherent  to  the  program.  Also,  an  analytical 
solution  had  to  exist.  Thus,  for  simplicity, 
dv  =  -sin(t)  +  a  dwf t]  ,  y(o)  =  1,  a  =  1.4 
dt  dt 

was  chosen.  The  solution  is  given  in  Gihman  and 
Skorohod  [1]  as. 

t2 

y  =  cost  +  (a  i dW(S),  where  W  is  a  Wiener 

Process 

<y>  =  cost 

Var  (y)  =  a2tfinal 

Here,  the  initial  value  of  y  was  chosen  to  be 
nonrandom  to  simplify  the  variance  calculation. 

As  can  be  seen  in  fig.  3,  the  numerical 
solution  of  the  mean  value  is  quite  close  to  the 
analytical  -  this  using  a  step  size  of  .01  and 
50.000  realizations. 

In  order  to  test  the  validity  of  the  white 
noise  simulation,  the  equation  was  run  first 
with  ct  =  1  and  then  with  a  =  4. 

The  Ornstein-Uhlenbeck  process  used  in  SDESS 
converges  to  a  white  noise  process  (dW(t))  as 
a.b  -»  “  ,  (a/b)  -»  1/2  where  the  covariance 
function  of  the  Ornstein-Uhlenbeck  process  is 
given  by: 

(1)  cov(t.t')  =  aeb^  1  ^  {[6],  pg.  55). 

Specifically.  SDESS  uses  (1)  to  calculate 
adjustment  vectors  which  modify  the  Gaussian 
random  variables  generated  each  realization.  To 
avoid  underflow  on  the  computer,  b  was  fixed  at 
1000  and  a  was  calibrated  using  the  a  =  1 
equation.  Since  a  linear  relationship  was  found 
to  exist  between  a  and  the  Var  (Y)  at 

interpolation  was  arbitrarily  stopped  when  the 
calculated  Var(Y)  was  close  to  the  analytical. 

To  check  the  validity  of  the  simulation,  o  was 
then  set  to  4  and  run.  the  Var  (Y)  calculation 
was  again  quite  close  (see  Table  2).  Ideally, 
analytically  determined  values  of  a  and  b  could 
be  given  which  would  yield  accurate  results 
5  5 

(such  as  a  =  10  ,  b  =  2  x  10  )  However,  it  is 
suspected  the  time  step  would  have  to  be  quite 
small  to  make  use  of  this  finer  approximation  - 
the  run  times  then  would  probably  be 
impractical.  The  calibration  method  although 
not  a  general  result,  does  appear  to  allow 
accurate  simulation  of  stochastic  processes. 

The  205  cpu  timings  for  each  equation  system 
are  given  in  Table  3. 

VI.  Remarks  on  Vectorization  of  the  Code. 

The  Cyber  205,  although  a  fast  sequential 
machine,  achieves  most  of  its  speed  via 
vectorization  of  code  segments  (usually  "DO” 
loops)  [4].  Thus  to  take  advantage  of  this 
capability  the  DO  loops  in  the  source  code  must 
have  certain  characteristics  which  will  allow 
the  computer  to  vectorize.  Unfortunately,  the 
Runge-Kutta  algorithm  wherein  SDESS  spends  about 
90%  of  its  run  time,  is  by  nature  a  recursive 
process  which  is  not  vectorlzable . 

Surprisingly,  the  seemingly  time  consuming 
collection  of  "IF"  statements  required  to 
compute  a  value  of  each  stochastic  process  each 
time  step  does  not  require  a  relatively  large 
amount  of  time.  Thus,  in  order  to  increase  the 
speed  as  much  as  possible,  all  subroutine  and 
function  calls  within  the  Runge-Kutta  "DO"  loops 


were  replaced  by  in-line  code.  Also,  the  random 
number  generation  using  the  function  "URAND"  [5] 
in  early  code  development  was  replaced  by  calls 
to  the  205  random  number  generator,  "RANF"  [4). 
These  two  modifications  resulted  in  the  modified 
SDESS  running  about  3  times  faster  over  the 
unmodified  code  on  the  205.  However.  SDESS  on 
the  205  is  only  about  30  times  faster  than  the 
VAX-750  implementation  -  low  efficiency  for  a 
vector  machine. 

If  a  parametric  study  of  a  particular  system 
of  SDE’s  were  desired  (as  in  nonlinear 
regression),  it  should  be  mentioned  that  saving 
the  values  of  the  stochastic  processes  each 
realization  and  using  these  on  subsequent 
iterations  would  result  in  a  speed  increase. 
However,  judging  from  experience  with  "turning 
off”  one  stochastic  process  the  speed  increase 
will  probably  be  modest. 

VII.  Conclusions 

A  solution  of  a  general  system  of  OSDE’s 
appears  to  be  feasible  using  somewhere  in  the 
5 

neighborhood  of  10  realizations.  However,  even 
using  a  supercomputer,  the  computation  is  still 
quite  expensive.  In  general,  run-times  would  be 
decreased  if  new  results  could  produce 
algorithms  which  converge  with  fewer 
realizations  and  particular  to  vector  processing 
computers  if  a  nonrecursive  numerical  method 
could  be  used  to  solve  the  SDE. 

Incorporating  this  solver  into  a  nonlinear 
regression  routine  using  absolute  differences  is 
planned. 
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Table  1.  Percent  error  from  exact  solution  at  t.  .  feauation  Al. 
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Eauation 

Time  ( seconds  1 

Cost 

1  (Elrod's) 

1884.77 

$1889. 

10  realizations 

2  (-sint) 

50.000  realizations 

711.30 

$715. 

*  5 

Based  on  running  equation  B  with  10  realizations  3  separate  times,  a  rough  standard  deviation 

for  the  <Y>  at  t_,  .  was  found  to  .0107.  A  standard  deviation  for  the  Var  Y  at  t...  ,  was 

f inal  f inal 

likewise  found  to  be  .0136.  These  will  decrease  in  proportion  to  the  square  root  of  the  number 
of  realizations.  Similar  values  would  be  found  for  Equation  A. 

Table  3.  205  central  processing  unit  time  and  cost  for  each  solution. 


FIGURE  I;  <Y>  OF  EQUATION  A  AT  T  *  2.5 


i 


SOLVING  NONLINEAR  ECONOMETRIC  MODELS  USING  VECTOR  PROCESSORS 

Patrick  J.  Hdnaff,  Massachusetts  Institute  of  Technology 
Alfred  L.  Norman,  University  of  Texas 


1.  INTRODUCTION 

This  paper  reports  on  the  design  and 
implementation  of  a  reduced  Newton  algorithm  for 
solving  large  non-linear  econometric  models  on  a 
vector  processor  such  as  the  CYBER  205  or  a  CRAY 
X/MP  machine.  To  take  full  advantage  of  the  vector 
processing  capabilities,  one  needs  to  organize  the 
computations  such  that  operations  are  carried  on 
vectors.  For  the  price  or  a  small  start-up  time, 
which  varies  with  machines,  a  vector  processor 
performs  a  floating  point  operation  on  a  component 
of  a  vector  in  a  fraction  of  the  time  needed  to 
perform  the  same  operation  on  an  isolated  scalar. 
Up  to  now,  numerical  methods  have  been  optimized 
for  scalar  processors;  this  involved: 

•  minimizing  the  storage  requirement.  For 
Newton’s  algorithm,  this  was  achieved  by 
developing  software  for  operating  on  matrices 
stored  in  sparse  format. 

•  minimizing  the  number  of  floating  point 
operations  by,  for  example,  computing  a  row  and 
column  permutation  that  reduces  the  amount  of 
fill-in  during  matrix  factorization. 

Vector  processors,  however,  suggest  a  different  set 
of  criteria  for  optimizing  the  implementation  of 
numerical  methods: 

•  Because  of  dramatic  reduction  in  cost  and 
progress  in  miniaturization  of  RAM  components, 
the  memory  available  on  recent  computers, 
especially  on  the  vector  processor,  is  very  large. 
As  a  result,  core  requirement  minimization  is 
not  as  strong  an  imperative  as  it  used  to  be. 

•  computation  should  be  organized  as  vector 
operations  as  much  as  possible,  even  if  this  leads 
to  some  redundant  computation. 

Let  us  now  turn  to  econometric  models.  To  solve 
such  models  by  Newton’s  method  on  a  vector 
processor,  one  approach  has  been  to  rewrite  sparse 
matrix  code  to  take  advantage  of  vector  processing. 
However,  that  approach  may  be  limited  by  the  fact 
that  sparse  matrix  techniques  were  designed 
according  to  criteria  not  entirely  relevent  to  vector 
processing.  In  contrast,  our  approach  is  to 
restructure  the  problem  at  hand  so  that 
computations  can  be  naturally  expressed  as  vector 
operations,  and  sparse  matrix  storage  schemes 
avoided  altogether.  The  rest  of  the  paper  is 
organized  as  follows:  Section  2  presents  the  solution 
algorithm  and  how  it  leads  to  vector  processing. 
Several  technical  features  of  the  method  are 
discussed  in  Section  3,  and  the  coding  is  considered 
in  Section  4.  Timing  of  vector  versus  scalar 
processing  solutions  for  a  medium  scale  econometric 
model  follows. 


2.  THE  SOLUTION  ALGORITHM  AND  HOW 
IT  CAN  BE  VECTORIZED 

2.1  The  solution  method 

We  consider  the  set  of  simultaneous  equations 


f (z)  =  0 


Ml 


where  f  is  a  n-component  function  and  z  a  vector 
of  n  endogenous  variables.  Predetermined 
variables  are  omitted  for  clarity.  At  a  given  point 
zk,  the  residual  of  the  system  is  Rzk)  =  dk-  The 
system’s  Jacobian  isJk  =  [af/az], 

A  reduced  system  equivalent  to  [1]  is  derived  by 
using  some  equations  to  eliminate  endogenous 
variables,  thus  reducing  the  dimension  of  the 
system.  First,  a  set  of  loop  variables  is  identified  in 
tne  system.  A  set  of  loop  variables  is  such  that  if  the 
variables  in  this  set  were  predetermined,  then  the 
other  variables  in  the  system  could  be  computed 
recursively.  After  an  appropriate  permutation  of 
variables  and  equations,  the  system  of  equations  is 

{lartitioned  into  2  blocks,  called  the  core  and  the 
oop  block: 


*U.y>  =  o 

h(x,  y)  =  0 
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where  g  and  h  are  vector-valued  functions  with 
respectively  (n-s)  and  s  components,  y  a  vector  of  s 
loop  variables  and  x  a  vector  of  n-s  core  variables. 
Finally,  at  a  point  (xk,  yk),  let  the  equation 
residuals  be  similarly  partitioned  (bk,  Ck).  For  each 
equation  in  the  core  we  define  the  function  f'  \, 
which  is  the  original  function  /i,  solved  for  variable 
*i: 


x.  =  f\(x( . x.  ^yt  i=l,...,n-s  [31 

The  error  function  is  defined  as: 

<J>(yi  =  h(x(y>,  y) 

with  x(y)  defined  by  the  core  equations  [31.  The 
original  system  is  equivalent  to: 


<P(y)  =  0 


[4] 


which  is  called  the  reduced  problem.  Its  Jacobian  is 
T*  =  [<34>/<Jy].  Newton’s  algorithm  applied  to 
problem  [1]  will  be  referred  to  as  the  Global 
Newton’s  algorithm  (GN),  while  the  same  method 
applied  to  problem  [4]  will  be  the  Reduced  Newton’s 
algorithm  (RN).  One  iteration  of  RN  involved  4 
steps: 

1.  Evaluate  the  error  function  at  the  current  point, 

yk: 


<Xyk)  =  dk 


(5! 


2.  Compute  T*k  at  yk. 

3.  Solve  Newton’s  equation  for  pj<: 

T*  Pk  =  -  dk  t6] 

4.  Convergence  test:  Stop  or  go  to  step  1  with 

=>k  +  Pk 

Corresponding  to  the  partitioning  of  variables,  the 
Jacobian  of  the  original  system  can  be  similarly 
partitioned  as: 


The  Jacobian  of  the  reduced  system  is  then: 


variable.  The  Jacobian  of  the  reduced  system  is 
immediately  obtained  by  dividing  each  column  by 
the  corresponding  hj. 

3.  TECHNICAL  ISSUES 

The  method  raises  several  technical  issues 
which  are  now  addressed.  Let  us  first  consider  the 
relationship  of  the  solution  algorithm  to  Newton’s 
method  applied  to  the  original  problem. 

The  original  and  reduced  system  are  equivalent 
in  the  sense  that  a  solution  to  the  reduced  system  is 
also  a  solution  to  the  original  system.  Moreover,  the 
following  fact  can  be  established: 

Fact:  Let  the  notation  be  as  in  section  2.  If  the 
functions  g  and  h  are  linear  with  respect  to  x,  then 
RN  and  GN,  started  from  a  common  value  yo  will 
generate  the  same  sequence  {yk}  for  the  loop 
variables. 

Proof:  Dropping  the  k  subscript  for  clarity,  let  the 
current  point  be  (x,  y)  and  the  corresponding 
residuals  be  (b,  c).  The  Global  Newton’s  step  for  the 
loop  variables  is 

t  =  T_,(-c  +  SG-Ib  1 


- 1  ioj 

T*  =  T  -  S  G  1  R 

A  finite  difference  approximation  to  T*  is  easily 
computed  because  of  the  recursive  structure  of  the 
core  equations.  The  jth  column  of  T*  is  obtained  by: 

<*!>  1  I  I  fQ1 

- —  rfcy  +  e *>  -  <Wy)  [91 

d\  h  I  I  J  I 

'  I  J 

where  ej  is  the  jth  unit  vector  and  hj  a  small  scalar. 
The  determination  of  hj  is  discussed  in  the  next 
section.  The  Jacobian  of  the  reduced  problem  is 
usually  dense  and  of  small  size,  so  that  tne  solution 
of  Newton’s  equation  for  the  reduced  problem  can  be 
easily  obtained.  This  solution  method  has  been 
described  in  (1). 


Let  us  consider  RN  applied  to  the  same  point.  First, 
x  is  evaluated  so  that  [3]  holds.  Since  g  is  linear 
with  respect  to  x,  the  solution  is: 

x*  =x~  G~’b 

The  new  residuals  of  the  loop  equations  are  then: 
b(x*.y)  =  c  -  SG  1  b 

Since  S  and  G  are  constant,  T*  evaluated  at  (x  + ,  y) 
is  the  same  as  T*  evaluated  at  (x,  y).  The  Reduced 
Newton  step  for  the  loop  variables  is  then: 


t+  =  -T~lh(x\y)=  -T  '<c  -  SG~'b)  l11l 


2.2  How  it  can  be  vectorized 

Vectorization  is  achieved  at  two  steps  of  the 
algorithm.  First,  in  evaluating  the  Jacobian  of  the 
reduced  model  by  [9]  and  second  in  the  solution  of 
Newton’s  equation  [6]  for  the.  reduced  model.  The 
vectorization  of  the  second  step  is  by  now  a  standard 
procedure,  since  the  Jacobian  of  the  reduced  system 
is  dense  and  stored  in  full  format.  Library 
subroutines  are  available  for  carrying  such 
computation  (see,  for  example,  (2)).  The 
vectorization  of  the  first  step  is  made  possible  by 
computing  in  sequence  the  perturbed  values  of  each 
core  and  loop  variables  corresponding  to  the 
perturbations  of  each  loop  variable.  Each  core  and 
loop  equation  is  now  a  vector  expression  where  each 
endogenous  variable  is  a  vector  of  length  s.  At  the 
beginning  of  the  computation,  the  vectors  of  loop 
variables  are  initialized  to  the  base  values  of  these 
loop  variables.  Then,  for  each  loop  variable  j,  the  ith 
element  of  the  corresponding  vector  is  perturbed  by 
fcj.  The  core  equations  are  then  evaluated 
recursively,  then  the  loop  equations.  The  result  is  a 
sXs  matrix  of  error  terms,  where  entry  (i J) 
represents  the  error  on  loop  equation  i 
corresponding  to  the  perturbation  of  tne  loop 


which  is  the  same  as  the  GN  step  in  [  10], 

In  the  general  nonlinear  case,  however,  the 
paths  to  a  same  solution  will  be  different.  In  the 
case  where  multiple  solutions  to  the  original 
problem  exist,  it  is  even  conceivable  that  RN  and 
GN,  started  from  the  same  point,  would  yield 
different  solutions. 

Two  convergence  tests  are  applied  at  each 
iteration.  Iterations  terminate  it  the  maximum 
relative  error  on  the  loop  equations  is  less  than  ei : 


max 

i 


VvV 


£  E, 


[121 


Iterations  also  terminate  if  the  maximum  relative 
change  in  the  variables  from  one  iteration  to  the 
next  is  less  than  C2: 


i  y.v  - 


max 

i 


(131 


Note  that  the  core  equations  are  by  construction 
exactly  verified  at  each  iteration,  so  that  criterium 
(12)  implies  that  the  maximum  relative  error  over 


all  equations  is  less  than  ej.  Well  conditioned 
problems  should  terminate  with  criterium  [12], 
Criterium  [13]  is  tested  after  [12]  to  terminate 
iterations  when  iterates  stabilize  at  a  point  away 
from  a  local  solution. 

The  selection  of  the  perturbation  scalar  hj  is  now 
considered.  The  problem  is  to  choose  for  each 
variable  yj  a  perturbatin  hi  such  that  it  minimizes 
truncation  and  cancellation  errors  on  the 
evaluation  of  ff  and  h. 

At  this  point,  a  trial-and-error  approach  has 
been  used,  with  the  selected  hj’s  corresponding  to 
the  perturbations  yielding  the  best  convergence 
properties.  A  more  systematic  approach  is  the 
object  of  current  research. 

4.  CODING 

The  coding  is  best  explained  through  an 
example,  for  which  pseudo-code  will  be  shown. 
Consider  the  system  arranged  in  quasi-triangular 


21  =  f 1(211 . 215) 

22  =f2(2t*ll . 215) 

23  =  f3(2l,22.2U . 215) 


210  =  f  10(21 . 29-211 . 21 5) 

211  =  f ll(21,  ■  ■  -.215) 


21 5  =  f 15(21 . 215) 

the  core  variables  are  2\  through  zjo,  and  the  loop 
variables  zn  to  Z15.  Let  zu  be  the  current  point.  It 
is  stored  in  row  6  of  array  Z.  The  error  at  the 
current  point  is  stored  in  row  6  of  array  PHI.  The 
reduced  Jacobian  at  zk  is  evaluated  as  follows. 

1.  Initialize  the  array  of  loop  variables  to  current 
point  value 

for  i  =  1  to  5 
for  j  =  11  to  15 
Z(i,j)  =  Z(6,  j) 

2.  Set  perturbed  values  of  loop  variables 
for  i  s  1  to  5 

Z(i,i  +  10)  =  Z(i,i  +  10)  +  H(i) 

3.  Execute  subroutine  REDUC.  A  5x5  array  PHI 
is  returned.  Each  row  is  the  perturbed  value  of 
the  error  vector  for  the  corresponding  perturbed 
loop  variable. 

4.  Compute  the  transposed  reduced  jacobian  by 
dividing  each  row  i  of  PHI  by  the  corresponding 
perturbation  H(i) 

for  i  =  1  to  5 
for  j  =  1  to  5 

PHKi.j)  =  (PHI(i,j)  -  PHI(6,j))/H(i) 

The  relevant  portion  of  subroutine  REDUC  is  as 
follows: 

nloop  =  5 

for  i  =  1  to  nloop 

Z(i,l)  =  Fl(Z(i,ll),  ...,Z(i,15) 
for  i  =  1  to  nloop 

Z(i,l)  =  F2(Z(i,l),  Z(i,ll) . Z(i,15)) 


for  i  =  1  to  nloop 

Z(i,10)=  FlO(Z(i,l),  Z(i,9),  Z(i,l  1 . Z(i,15)) 

for  i  =  1  to  nloop 

PHI(i.l)  =  FI  l(Z(i,l) . . .,  Z(i,15» 


for  i  =  1  to  nloo 


to  nloop 
,5)  =  FI 


15(Z(i,l) . Z(i,15)) 


5.  NUMERICAL  RESULTS 

The  authors  selected  for  experimentation  the 
Texas  Econometric  Model  (version  M5)  developed 
by  the  Bureau  of  Business  Research  of  The 
University  of  Texas  in  Austin.  With  293  equations, 
this  model  is  characteristic  of  intermediate  size 
econometric  models.  It  can  be  partitioned  into  a  59 
equation  recursive  prologue,  a  201  equation 
simultaneous  block,  and  a  33  equation  recursive 
epilogue.  The  three  algorithms  considered  were  the 
reduced  Newton  (RN),  the  modified  reduced  Newton 
(mRN),  and  the  Gauss  Seidel  (GS).  The  mRN 
algorithm  was  obtained  by  using  the  LU 
decomposition  of  T*  computed  at  the  first  iteration 
in  all  subsequent  iterations.  The  GS  algorithm 
employed  the  ordering  and  normalization  of  the  RN 
algorithm  with  each  loop  equation  normalized  on  a 
loop  variables. 

The  algorithms  were  coded  in  FORTRAN  77  and 
compiled  with  the  FTN200  compiler  on  the  CYBER 
205  at  Purdue  University.  For  the  scalar  runs  a 
scalar  version  of  LEQFIT  subroutine  from  the  IMSL 
subroutine  library  was  employed  to  solve  the  linear 
system.  For  the  vectorized  runs  the  GEL  subroutine 
from  the  MAGEV  library  (2)  was  employed.  The 
results  for  solving  the  model  for  one  period  (1970), 
with  a  convergence  criterion  E2  =  .le-3  are 
displayed  in  the  following  table: 


Table  I 

Number  of  iterations  and  time  (sec)  to  solve 
model  MS  for  1970 

Algorithm  No  Vectorization  Vectorization 


19  (.0365) 
4  (.1454) 
4  (.0586) 


19  (.0318) 
4  (.0546) 
4  (.0200) 


For  the  GS  code  the  recursive  nature  of  the 
algorithm  prohibits  any  significant  vectorization. 
The  only  operation  which  can  be  vectorized  is  the 
storing  of  the  current  point  at  a  given  iteration  for 
comparison  with  the  result  of  the  next  iteration.  In 
the  RN  algorithm,  most  of  the  steps  can  be 
vectorized.  In  the  mRN  code,  the  GEL  subroutine 
solves  the  linear  system  [6]  in  less  the  10%  of  the 
time  required  to  obtain  a  single  scalar  solution  to 
the  core  and  loop  equations.  This  means  that  in  the 
mRN  algorithm  the  second  and  subsequent 
iterations  are  obtained  at  a  cost  only  slightly 
greater  than  the  cost  of  a  Gauss  Seidel  iteration. 
With  vectorization,  mRN  achieves  a  saving  of  about 
1/3  over  GS.  On  the  CYBER  205  setting  up  the 
vector  pipeline  has  a  substantial  overhead  which 
can  be  seen  in  Table  2.  The  table  displays  the  time 
needed  to  execute  the  subroutine  REDUC  for 
various  values  of  parameter  nloop  (see  Section  4). 


Table  2 

Time  to  solve  the  core  and  loop 

nloop  1  6  11  16  21 

scalar  code  .0018  .0078  .0142  .0204  .0268 
vector  code  .0053  .0062  .0079  .0097  .0103 

Given  the  overhead  in  setting  up  the  pipeline, 
the  time  to  solve  the  model  using  the  scalar 
processor  is  less  than  the  time  for  the  vector 
processor  for  nloop  less  than  8.  Thus,  when  using 
the  CYBER  205,  trying  to  minimize  the  number  of 
loop  variables  is  not  too  relevant,  at  least  when 
intermediate  size  models  are  considered.  On  a 
CRAY  machine,  however,  some  increase  in  speed 
occurs  even  with  a  vector  of  length  two.  Hence 
where  the  number  of  loop  variables  is  small  one 
would  expect  the  RN  and  mRN  algorithms  to  be 
much  more  effective  on  CRAY  machines. 


6.  CONCLUSION 

A  Newton-type  algorithm  adapted  to  vector 
processing  has  been  described.  Preliminary 


numerical  results  are  encouraging:  An  m-step 
Newton’s  method  was  found  to  be  33%  faster  than 
the  Gauss-Seidel  algorithm  in  solving  an 
econometric  model  of  intermediate  size. 

Further  research  is  planned  along  two  axis.  The 
first  one  is  to  perform  more  experiments,  this  time 
using  a  much  larger  multicountry  model  with  a  loop 
of  over  100  variables.  It  is  also  intended  to  carry  the 
same  computations  on  a  CRAY  X/MP-24,  soon  to  be 
installed  at  the  University  of  Texas  in  Austin.  The 
other  axis  is  to  develop  a  systematic  approach  to 
some  aspects  of  the  algorithm,  in  particular  to  the 
choice  of  the  perturbation  values  used  in  the 
computation  of  the  reduced  Jacobian  by  finite 
differencing. 

7.  BIBLIOGRAPHY 

(1)  Nepomiastchy  et  al.  "Adapted  Methods  for  Solv¬ 
ing  Optimizing  Quasi-Triangular  Econometric 
Models,”  Annals  of  Economic  and  Social 
Measurement,  Vol.  12, 1978. 

(2)  'The  Math-Geophysical  Vector  Library”,  docu¬ 
ment  22-MAGEV,  Purdue  University  Com¬ 
puting  Center. 


B-SPLINE  ESTIMATION  OF  THE  HAZARD  FUNCTION  IN  PERIOD  ANALYSIS 


John  J.  Hsieh,  University  of  Toronto 


This  article  develops  a  precise  method  for  estimating  the  hazard 
function,  survival  function  and  density  function  for  period  analysis 
using  B-splines.  Explicit  expressions  for  the  representation  of  the 
hazard  function  as  linear  combination  of  quadratic  B-splines  are  ob¬ 
tained.  Accuracy  is  achieved  by  using  three  overlapping  consecutive 
age-specifc  death  rates  covering  three  years  each,  as  coefficients  of 
the  B-spline  basis  defined  on  a  single-year  uniformly-spaced  knot  sequ¬ 
ence.  The  exact  expressions  of  various  functions  describing  the  prob¬ 
ability  distribution  of  the  lifetime  are  derived  from  the  hazard  func¬ 
tion.  The  methods  are  illustrated  using  1981  Canadian  male  population 
and  death  data. 


1.  INTRODUCTION 

The  intent  of  this  article  is  to  employ  the 
B-spline  basis  to  estimate  the  hazard  function 
(force  of  mortality)  as  well  as  its  associated 
survival  and  density  functions  using  death  and 
population  data  from  period  analysis. 

Data  from  government  publications  available 
for  the  study  of  the  distribution  of  the  human 
life  length  come  from  two  sources:  Those 
classified  by  age-of-death  come  from  vital 
registration  and  yield  counts  of  occurences 
of  deaths  grouped  by  single-year  age  Intervals 
while  those  pertaining  to  age-st ill-alive  come 
from  census  and  give  rise  to  counts  of  population 
size  grouped  in  five-year  age  intervals. 

A  well-known  method  of  estimation  as  well  as 
approximation  used  in  mortality  analysis  is  to 
compute  the  age-specific  death  rate  as  the 
ratio  of  the  number  of  deaths  to  the  number  of 
person-years  of  exposure  in  a  given  age  interval 
and  to  estimate  the  hazard  function  at  every 
age  within  the  age  interval  by  the  death  rate  for 
that  interval.  The  difficulty  with  this  procedure 
is  that  the  hazard  function  so  estimated  is 
constant  for  every  age  within  an  age  interval 
and  jumps  at  every  division  point  between  two 
age  intervals,  so  that  the  hazard  function  is 
estimated  by  a  step  function  with  step  width 
normally  five  years  long. 

In  this  article  we  shall  improve  this  time  - 
honoured  procedure  by  smoothing  out  the  step 
function  estimate  using  spline  functions.  This 
is  accomplished  by  combining  the  death  rate  in 
the  age  interval  within  which  the  hazard  function 
is  sought  with  the  death  rates  from  the  left 
and  the  right  adjacent  intervals  and  by  redis¬ 
tributing  them  in  a  quadratic  fashion  among 
these  consecutive  overlapping  age  intervals. 

The  technique  with  which  this  is  effected  is 
the  representation  of  the  hazard  function 
as  a  linear  combination  of  the  quadratic  B-splines 
on  single-year  uniformly-spaced  knot  sequences 
with  the  overlapping  three-year  age-specific 
death  rates  as  the  coefficients.  To  compute  the 
death  rates,  the  populations  in  five-year  age 
groupings  are  cumulated  and  then  interpolated 
into  single  years  using  a  complete  cubic  spline. 
Once  the  hazard  function  is  obtained,  the 


survival  function  and  the  death  density  function 
are  evaluated  directly  from  the  hazard  function 
by  exact  integration. 

The  spline  functions  employed  in  this  paper  both 
for  interpolation  and  for  estimation  enjoy  minimum 
norm,  best  approximation  and  fast  convergence 
properties.  The  set  of  the  B-spline  basis  is  a 
generalization  of  the  "hat"  functions  and  is  a 
well-conditioned  basis  for  spanning  spline 
functions.  As  a  Peano  Kernel,  it  provides  a  local 
partition  of  unity  on  the  entire  agespan  with 
small  supports. 

Since  the  behavior  of  the  hazard  function  (in 
particular,  its  speed  of  decline)  during  the  first 
year  of  life  differs  from  that  for  the  remainder 
of  the  lifespan  and  since  the  infant  population 
tends  to  be  underenumerated,  the  method  and  data 
for  estimating  the  hazard  function  for  ages  under 
one  should  differ  from  the  methods  of  estimation 
for  the  remaining  life.  An  infant  mortality  law 
and  a  method  for  estimating  the  hazard  and  other 
related  functions  for  the  first  year  of  life  have 
recently  been  provided  by  the  author  (see  Hsieh, 
1985).  We  shall  make  use  of  the  results  from  that 
work  and  in  this  paper  Concentrate  on  the  agespan 
ltl»tnl»  where  ti  =  1  year  and  tn  may  be  taken  as 
85  or  90  years  or  whatever  advanced  age  depending 
on  the  availability  and  reliability  of  data  at 
these  ages. 

In  Section  2,  explicit  expressions  are  derived 
for  the  quadratic  B-splines  as  well  as  their 
derivatives  and  integrals.  Several  useful  simple 
properties  of  B-splines  are  also  discussed. 

Section  3  describes  the  method  of  estimation  of 
the  hazard  function  using  B-splines.  Explicit 
formulas  are  given  for  estimates  of  the  hazard 
function,  survival  function  and  the  death  density 
function  based  on  single-year  uniformly-spaced 
knot  sequences.  Section  4  provides  an  example 
of  estimation  using  1981  Canadian  male  population 
and  death  data.  A  comparison  is  made  with  other 
two  existing  methods  of  estimation  for  period 
analysis . 

2.  B-SPLINES  AND  THEIR  PROPERTIES 

There  are  several  ways  of  defining  a  B-spline 
(B  stands  for  basis).  We  shall  use  a  definition 
consistent  with  current  usage  and  appropriate 
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.  At. 


i.t  *.t  1/  4.*  li'.ta*  li'ti’J*1  «»>' , 


for  our  application  (For  a  survey  and  list  of 
references  about  B-splines  see  de  Boor,  1976). 

For  a  chosen  sequence  of  knots  t_f+2  _<  . . .  <t  ^<t2  .. 

<t  <  ...<t  (r-1, 2  , . . .) ,  the  i-th  B-spline  of 

n —  —  n+r-1 

order  r,  denoted  by  B^f  or  simply  by  B^  with  the 

order  r  understood,  is  a  piece-wise  polynomial 
of  degree  r-1  defined  as  the  product  of  (t^+j.-t^) 

and  the  1-th  divided  difference  of  the  truncated 

power  function  (t-x)^  *  2  roax(0, (t-x) r  with 

respect  to  t  at  the  knots 

(i=-r+2  , . . .  ,n~l) .  In  symbols, 


Bir(x)  = 


1+1 ’ '  ”  ,li+r 


where  f[t  ,...,t  ;x]  is  the  r-th  divined 

i  l+r  r_l 

difference  of  f(t;x)=(t-x)  at  the  points 


ti,ti+l’-“’ti+r- 


In  other  words,  f[t 


t^+j.;x]  is  the  leading  coefficient  of  the  poly¬ 
nomial  of  degree  r  which  agrees  with  the  function 
f(t;x)s(t-x)*-1  at  the  points  t^t^, . . .  ,t1+j.. 

Since  published  data  on  populations  and  deaths 
are  given  in  single-year  or  five-year  age  group¬ 
ings,  to  construct  a  B-spline  basis  for  the 
purpose  of  estimating  the  hazard  function,  we 
shall  partition  the  agespan  [t^,  tnl  by  choosing 

the  exact  integral  ages  as  the  division  points 
of  the  age  axis  and  place  one  knot  t^  on  each 

of  the  interior  division  points,  i»2 ,3, . . . ,n-l, 
and  r  knots  on  each  of  the  two  boundary  points 


so  that  t  =t  =.. 
1  o 


=t  .  „  are  the  initial  knots 
-r+2 


and  t  =t 

n  n+1 


.  ..=t  .  ,  are  the  final  knots. 

n+r-1 


f  [t,t,s;x]  =  f'(t;x)-[f(t;x)-f(s;xiV(t-s)  (3) 

t-s 

f[t,t,t;x]  -  f ' ' (t;x) /2 I  etc.,) 

explicit  formulas  for  B-splines  of  order  r-3  are 
derived  and  given  in  Column  2  of  Table  1.  From 
the  expressions  in  Column  2,  the  formulas  for  the 
first  order  derivatives  and  integrals  are  derived 
and  shown  in  Columns  3  and  4 ,  respectively  (where 

we  have  used  the  notation  ^=t^+^-t^  and  k^X^* 

j*  B^(y)dy  for  xE ( 1 1+k , t i+k+1> ,k=0 , 1 ,2) .  Notice 

ti+k 

that,  for  i=l,2,...,n-3,Bi(x)  is  a  three  piece 
quadratic  with  support  (t^,t^+j)  and  *s  continu¬ 
ously  differentiable  at  each  of  its  four  knots 

ti,ti+t’ti+2  and  Ci+3‘  F°r  i=°’  Bi(xHs  a  tw°- 
piece  quadratic  with  support  ( t  ,  t  ^+^)  and  is  con¬ 
tinuously  differentiable  only  at  the  two  knots  ti+2 
and  t^+2  but  not  at  the  boundary  point  x=l  because 

two  knots  t.  and  t.,,  are  placed  there.  For  i=-l, 
i  l+l 

B^Cx)  consists  of  only  one  quadratic  function 
with  support  [ c 1+2 » 1 i+3 >  and  ls  neither  continuous 

nor  continuously  differentiable  at  x-1  because 
three  knots  are  placed  there.  Similarly,  the 
same  may  be  said  of  the  case  with  i=n-2  and  n-1 
on  the  opposite  end  of  the  agespan  in  a  sym¬ 
metrical  fashion.  The  expression  for  the  value 
of  the  B^(x)  and  its  derivative  at  each  knot  is 

also  shown  in  Table  1. 

B-splines  have  many  desirable  properties,  some 
of  which  we  will  be  making  use  of  are  listed 
below: 


$ 

§ 

& 


Both  theoretical  and  empirical  considerations 
Indicate  that  choice  of  r=3  or  4  will  produce 
optimal  results.  For  estimating  the  hazard 
function,  quadratic  B-splines  (r=3)  are  prefer¬ 
able  to  cubic  B-splines  (r=4) .  The  use  of  the 
former  will  avoid  possible  undue  undulations 
caused  by  the  use  of  cubic  or  higher-degree 
splines,  at  the  same  time  retain  the  smoothness 
of  a  spline  function,  with  the  added  benefit  of 
simpler  expressions  for  the  hazard  function  as 
well  as  the  survival  function  and  the  death 
density  function  derived  therefrom. 

Using  the  definition  of  Bir.(x)  given  above  and 

the  recursive  relations  of  the  divided  difference 
given  below: 

fftl+l . tl+r;x]~f[t:i . tl+r-l;x]  ,  (2) 

ti+r'ti 

(when  multiple  knots  occur,  then  the  derivatives 
naturally  enter  (2),  so  that 

f[t,t;x]  -  f(t;x) 


(L)  For  each  i,  B^(x)  is  a  spline  function  of 
degree  r-1  on  the  real  line  if  no  multiple  knots 
are  involved  and  hence  that  t(x)  =  /  (y)dy 

is  a  monotone  increasing  spline  function  of  order 
r+1  with  <!>(x)=0,  for  x^t^. 

(ii)  The  area  under  the  curve  Is  given  by 

/  8  (x)dx  =(t  -t  )/r.  (4) 

-  00  i  l+r  i 

Thus,  'nr  single-year  uniformly-space  knot 
sequences,  t^=i,  all  i,  this  integral  becomes 

unity  so  that  B^(x)  represents  a  probability 

density  on  the  real  line. 

(iii)  The  support  of  the  B^(x)  functions  is 
restricted  to  (ti»tj+r)> 


Ji(x){  . 


>  0  for  xc(t  t,  ) , 


=  0  for  x*(ti*ti+r)* 


(fv)  The  sum  of  the  B^'s  at  a  given  x  is  unity, 


i 


TABLE  1.  EXPRESSIONS  FOR  QUADRATIC  B-SPLINES,  THEIR  DERIVATIVES  AND  INTEGRALS 


i.e.,  for  xe(tj ,tj+1 J,J"1, . . . ,n-l. 


l  3  (x)  =  l  B  (x)  =  1. 
1  i=j -r+1  1 


f(x)  =  T.  a.B.(x) 
(6)  i-j -r+1  1  1 


and  hence  for  single-year  knot  sequence, 
j  4+1 

E  I  B  (x)dx  =  1,  (7) 

i=j-r+l-J 

j 

so  that  E  B  (x)  represents  a  probability 
i=j-r+l  1 

density  on  [j,j+l]. 

(v)  .  At  a  given  division  point  on  the  age  axis, 
the  number  of  continuity  conditions  plus  the 
number  of  knots  equals  the  order  of  the  B-splines. 

(vi)  .  Every  spline  function  f(x)  of  degree  r-1 
based  on  the  above  sequence  is  uniquely  represen¬ 
ted  by  a  linear  combination  of  the  B-splines 
basis.  Thus,  for  xe[tj,tj+^],  for  some 


Furthermore,  the  linear  span  is  strictly  convex. 

Properties  (i)  through  (iv)  are  direct  con¬ 
sequences  of  the  definition  of  B-splines  and/or 
Peano's  theorem.  Properties  (v)  and  (vi)  were 
shown  by  Curry  and  Schoenberg  (1966).  All  these 
properties  can  be  verified  from  the  explicit 
formulas  for  the  quadratic  3-splines  and  their 
integrals  given  in  Table  1.  There  are  other 
mathematical  properties  most  of  which  are 
derivable  using  the  total  positivity  property  of 
B-splines  due  to  Karlin  (1968).  As  more  proper¬ 
ties  are  uncovered  about  the  B^  basis  splines, 

their  importance  in  both  theoretical  and  applied 
works  will  become  evident. 


3.  ESTIMATION  OF  HAZARD  FUNCTION.  SURVIVAL 
FUNCTION  AND  DEATH  DENSITY  FUNCTION 


We  shall  represent  the  hazard  function  h(x) 
over  the  agespan  (t^.tj  by  a  spline  function  and, 

in  accordance  with  equation  (8),  express  it  as 
a  linear  combination  of  the  B-splines  with  the 
coefficients  a^  to  be  estimated  from  the  ob¬ 
served  population  and  death  data.  Accordingly, 
the  estimate  of  h(x)  on  [t^,t  ]  may  be  written 
as 

h(x)  =  Z  a.  B.(x).  (  1 

i=-r+2  1  1 


With  quadratic  splines,  the  explicit  expressions 
for  B^(x)  are  given  in  Table  1. 


where  p(x)  is  the  population  profile 
function.  The  numerator  in  (10)  represents 
the  number  of  deaths  per  unit  time  and  the 
denominator  the  number  of  persons,  both  for 
the  age  interval  [t.-d.  ,t ,+d. 1 .  It  follows 

l  l  ’  l  i 

from  (10)  that  if  the  hazard  function  is 
constant  over  [  t^-d^ ,  t  ^+d^  ]  ,  then, 

h(x)=  .  M-  .  This  result  inpli  s  -he 

2di  trdi 

validity  (but  not  accuracy)  of  the  con¬ 
ventional  method  of  estimation  mentioned 
in  Sect  ion  1 . 

To  derive  an  estimate  for  a^,  we  assume 

h(x)  and  p(x)  to  have  derivatives  up  to 
the  second  order  and  expand  them  around 
t^  as  follows: 


Determination  of  the  coefficients  of  the 
B^-splines  in  (9)  requires  information  about 

h(x).  If  we  are  given  a  set  of  h(x^)  values 
at  some  n-1  age  points  xi£[ ti> t i+1) , i»l , — ,n-l, 

plus  r-1  values  of  h(x)  and  its  derivatives  at 
the  two  boundaries,  say,  then  we  may  solve  (9) 
as  a  linear  interpolation  problem  to  obtain 
estimates  of  the  (n+r-2)  coefficients 


a^.  However,  it  is  difficult  to  obtain 

accurate  values  of  h(x)  at  so  many  age  points, 
much  more  so  with  its  derivatives.  This 
approach  is,  therefore,  not  feasible.  Even 
though  we  do  not  know  h(x),  the  population 
and  death  data  do  provide  us  with  certain 
average  values  of  h(x)  over  relatively 
small  age  intervals.  These  are  weighted 
averages  of  h(x)  with  age  distribution  as 
weights  and  are  known  as  age-specific 
death  rates  (see  (10)  below).  Computation¬ 
ally,  the  death  rate  over  the  age  interval 
[x,x+y],  denoted  by  yMx,  is  the  ratio  of 

the  number  of  deaths  to  the  person-years  of 
exposure  in  the  age  interval  [x,x+y],  the 
numerator  coming  from  death  (age-of-death) 
data  while  the  denominator  from  population 
(age-still-alive)  data.  We  shall  now 
derive  our  estimates  of  the  a.'s  in  terms 

l 

of  the  y**x's  with  specific  choices  of  x 

and  y  and  show  that  they  are  indeed  good 
estimates. 


i’ 


For  every  sequence  of  r+1  knots,  t 

ti+l’"',ti+r’  f°r  i=-r+2»‘--'n_1>: 
define  the  truncated  mean  t^«(ti+1+. . .+ti+r_^) / 

(r-1)  and  the  symmetrical  range  2d^  where 


d1=min  (t  is  the  minimum  length 

of  age  intervals  from  the  truncated  mean  t^  to 


the  two  extreme  knots.  Then,  the  death  rate 
over  the  age  interval  [t^  -d ^  ,t^+d^]  associated 


with  these  r+1  knots  is  given  as  the  weighted 
ave rage 


_  .  M-  , 

2di  V1! 


t  . +d^ 

f  h(x) p(x)dx 

Vdi 


(10) 


;ti+di 

Vdi 


p(x)dx 


h(x)  =  h(t^)+h'(t^)(x-t^)  +  °((x-t^)  ),  (11) 

p(x)  =  pU^+p'  (t^)  (x-t^+of  (x-t^)  ).  (12) 

Then,  substitute  (11)  and  (12)  into  (10)  and 
integrate  out  to  obtain 

h(t  )p(t  ,)  +  o(d  2) 

^  * 

2di  di  p(tt)+o(d2) 

-  h(ti)+o(d2).  (13) 


On  the  other  hand,  following  the  arguments 
used  in  the  quasi-interpolant  approximat ion 
of  de  Boor  and  Fix  (1973)  ,  one  easily  obtains 
the  following  approximation  for  the  B-spline 
coefficients: 


at  =  h(ti)+0(d1  ) 


Substituting  (13)  in  (14),  we  have 


ai  ’  2diMEi-di+°(di  > 


(14) 


(15) 


Notice  that  the  order  of  approximation  obtained 
in  (15)  is  independent  of  the  order  r  of  the 
spline  used  as  long  as  r>3.  Equation  (15) 
suggests  that  M-  .  is  a  good  estimator 

2di  Vdi 

for  a^  for  small  spacings  between  the  knots 

such  as  single-year  uniformly-spaced  knot 
sequences.  Substituting  this  estimate 


2diVdi 


(16) 


into  (9)  yields 

j 


h(x)  =  T. 


l-j-r+1 


2diMfi'diBi(X)  ’ 


(17) 


for  xe  [t  ,t  ),  j  =  l  ,2  , . . .  ,n-l ,  since,  by 
Property  (iii)  of  Section  2,  each  has  a 
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support  covering  only  r  consecutive  age  intervals. 
Formula  (17)  is  the  estimator  of  the  hazard 
function  we  propose.  From  Property  (i)  of 

Section  2,  h  is  of  continuous  class  Cr_2 
and  its  integral  of  continuous  class 

Cr_1  over  (t^^).  Because  of  (13),  the 

estimator  (17)  is  sharper  than  Schoenberg's 
shape-preserving  or  variation  diminishing 
approximation  (Schoenberg  1967).  It  can 
also  be  seen  as  a  modification  as  well  as 
generalization  of  Breslow's  (1974)  estimator 
for  cohort  analysis  which  employs  constant 
splines  over  random  death  times.  If  h(x) 
is  constant ,  then,  in  view  of  equation  (6), 

equation  (17)  reduces  to  h(x)  =  M- 

2di  Vdi 

so  that  the  estimation  is  exact  following 
the  arguments  immediately  below  equation 
(10). 

For  quadratic  splines  with  single  year 
knot  spacing  and  n=90,  we  have  r’O.t^i, 

,0 , 1 , . . . ,90 ,91  and  92.  The  area 
under  the  B.(x)  courve  for  each  single 

year  age  interval  is:^  2(2)=iJ;89  Q(90) 

=1/3,l^0,l(2)='*'88  l(5°)  =  l/2.  and 

l|'i,0(1+1)=ll'i,2(1+3)=1/6>  ^i+i(l+2) 

=2/3  ,i=l ,  . . .  ,87  .  The  estimators 
of  the  a^'s  are  obtained  from  the  defini¬ 
tion  of  Fand  d^  and  equation  (16)  to  be: 
a_i=h(l)  ,SQ=1M1,  ai=3Mi,i=l , . . .  ,87  , 

^  A  ^ 

a88=lM89  ’  3  89=h^0^  '  To  obtain  explicit 
formulas  for  h(x)  ,  we  substitute  into  (17) 
the  above  estimates  for  a^s  and  the 

expressions  for  B^x)  for  single  year  knot 
spacing  derived  from  Table  1.  Integration 

of  h(x)  then  leads  to  estimattsof  the 
survival  function  and  density  function. 

Hence,  the  hazard  function  is  estimated  by 
h(x)  =  h(l)(2-x)2-1M1(3x2-10x-4-7)/2 

+JM1(2-x)2/2,  for  l<x<2; 

h(x)  =  jMj (3-x) 2/2-jMj (2x2-10x+11)/2 

■>-3M2(x-2)2/2,  for  2  <x<3 ; 

h(x)  =  3Mj_2(j+l-x)2/2 
2 

~3Mj-lt2x  -2C25  +  l)x  +  2(j-l)  (j*2>3]/2 
♦3M.(x-j)2/2,  j<x<j*l,  j=3, . . . ,87 , 

h(x)  =  3M86(89-x)2/2*3M87[l-(fx-88)2  +  (89-x)2)  /2) 
+1Mgg(x-88)2/2,  for  88<x<89; 


h(x)  =  3M87f90-x)2/2+iM89t1-(x-89)2-(90-x)2/2] 

2 

+h(90)(x-89)  ,  for  89£x<89.  (18) 

The  survival  function  is  estimated  by 

In  F(x)  =  In  F  (1)  -  ^  h  (y)dy,  (19) 

where  A(y)dy=  ^  H.  (i  +  1)  +  H  .  (x)  ,  j  =  1 ,  .  .  .  ,  89  , 
and  xe[j  ,j+l],  and  H  (x)  =  ,f*h  (y)dy 

J  j 

is  given  by 

Hj  (x)  =  h(l)[(x-2)3+l)/3  +  jMj(3-x)(x-l)2/2 
+3M1(x-1)3/6, 

H2(x)  =  jMj  [(x-3)3+l  ]/6+3Mj(  2  -  x)  (2x2-llx  +  ll)/6 
♦3M2(x-2)3/6, 

H.(x)  =  3Mj.2g.j.2>2(x)+3M. 

+3Mj^j  for  j=3*--’>87; 

Hgg(x)  =  3Mg6[l-(89-x)3]/6 

+3Mg7[[(88-x)3+(89-x)3-l]/6-(88-x)i 

+1M89(x-88)3/6, 

H89(x)  =  3Mg7[l-(90-x)3]/6 

♦  MaQ [  x-89-(x-89)  3/3-  [( x -90) 3  +  1 ] / 6 ] 
1  89 

+h(90) (x-89)3/3.  (20) 

To  compute  with  the  above  formulas,  three 
values  h( 1)  ,  h(90)  and  F( 1)  are  still 

needed.  The  values  of  h(l)  and  F(l)  are 
obtained  from  the  method  described  in 

Hsieh  (1985)  and  h(90)  is  obtained  by 
fitting  Gompertz  curve  through  the  last 
three  age-specific  death  rates.  Estimate 
of  the  death  density  function  f(x)  is 
obtained  by  substituting  (18)  and  (19) 
into  the  following  formula 

f (x)  »  h(x) F(x) .  (21) 

Because  of  age  heapings  and  reporting  and 
other  errors,  population  data  are  normally 
published  in  five-year  age  groupings.  To 
compute  the  death  rates  ,i=2  ,  . .  .  ,87  , 

jMgg,  we  interpolate  the  five-year  cumulated 


population  to  the  exact  integral  ages  using 
a  complete  cubic  spline  function.  Subtraction 
then  yields  the  required  population  for  the 
denominator  of  the  death  rates. 

In  conclusion,  by  using  quadratic  B-splines, 
we  have  considerably  improved  the  accuracy  of 
the  hazard  function  estimate  over  the  conven¬ 
tional  method  in  which  zero-degree  B-splines 
are  in  effect  used. 


4 .  AN  EXAMBLE 

We  ha-<-o  employed  the  tabulated  registered 
deaths  ’td  census  populations  for  Canadian 
males,  l>al,  to  estimate  the  hazard  function, 
using  (18),  (19),  (20)  and  (21).  Death 
rates  are  computed  as  the  ratio  of  deaths 
to  the  mid-year  population  in  the  indicated 
age  intervals.  To  obtain  the  population  for 
these  age  intervals,  the  tabulated  popula¬ 
tions  in  five-year  age  groupings  are  cumulated 
backward  and  then  interpolated  to  exact 
integral  ages  using  complete  spline  inter¬ 
polation  using  either  the  procedure  given 
in  Ahlberg,  et.al.  (1967)  or  that  given  in 
Schoenberg  (1973).  The  initial  endslope 
for  the  complete  spline  is  estimated  by 
subtracting  the  number  of  deaths  under  age 
one  from  the  births  and  the  final  endslope 
by  adding  one-half  the  deaths  in  the  last 
age  interval  to  the  population  in  that  age 
interval.  The  estimates  of  the  three 
functions  h(x) ,  F(x)  and  f(x)  are  shown  in 
Figures  1,  2  and  3.  Figure  la  magnifies 

h(x)  for  ages  from  1  to  3  years.  A 
comparison  with  the  other  two  spline 
methods  of  estimating  the  hazard  function 
and  its  derived  functions,  namely,  those 
of  Hsieh  (1979)  and  Okamoto  (1979),  shows 
practically  no  differences  in  the  results 
and  the  graphs  obtained  from  the  latter 
two  methods  are  virtually  indistinguishable 
from  those  shown  in  Figures  1,  la,  2  and  3. 

The  principal  advantage  of  the  present  method 
lies  in  the  fact  that  the  hazard  function  can 
be  integrated  out  exactly,  resulting  in  a 
low  degree  polynomial  spline. 
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A  LEISURELY  LOOK  AT  THE  MODELS  OF  UNCERTAINTY  IN  EXPERT  SYSTEMS 


Syni-An  Hwang,  SUNY  Albany 


SI.  Introduction 

What  is  an  Expert  System?  Broadly 
speaking,  we  say  that  an  Expert  System 
is  a  computer  program  that  uses 
explicitly  represented  knowledge  and 
computational  inference  procedures  to 
solve  problems  normally  thought  to 
require  human  expertise.  More  precisely, 
its  purpose  is  to  obtain  the  knowledge 
of  experts  in  a  particular  domain, 
represent  it  in  an  expandable  knowledge 
base,  and  transfer  it  to  users  for 
solving  other  problems  in  the  same 
problem  domain. 

Currently,  there  is  a  great  deal  of 
interest  in  introducing  uncertainty  into 
an  Expert  System.  Given  a  simple  rule: 
If  E,  then  H,  the  expert  actually 
expressed  the  rule  as:  If  E,  then  H 
with  p.  The  user  actually  provides  the 
information  as:  E  is  true  with  pE.  p 

is  a  measure  of  strength,  with  higher 
strength  indicating  greater  power  the 
evidence  to  confirm  the  hypothesis. 

In  this  paper,  we  will  study  some  of 
the  best-known  approaches  to  model  the 
uncertainty  in  an  Expert  System.  There 
have  been  papers  devoted  to  similar 
topics,  Bonissone  (1982),  and  Black  and 
Eddy  (1985).  One  of  the  aims  of  this 
paper  is  to  discuss  the  points  raised  by 
previous  papers. 

§2.  Bayesian  Probabilistic  Models 

In  the  Bayesian  probability  model,  it 
is  assumed  that  probability  measures  the 
degree  of  belief.  Let  P(H)  denote  the 
prior  belief  in  H.  When  new  evidence  E 
is  obtained,  the  posterior,  P(h|e), 
denotes  revised  belief  in  H  upon  learn¬ 
ing  that  E  is  true.  Bayes'  rule  can  be 
expressed  in  odds-likelihood  form  as: 

0(H | E)  =  LR ( E | H )  »  0(H)  (2.1) 

where  0(H),0(H|e),  and  LR(E|H)  are 
prior  odds,  posterior  odds,  and  likeli¬ 
hood  ratio.  To  implement  such  a  rule, 
the  expert  provides  the  likelihood  ratio 
and  the  prior  odds.  Then  by  eq.  (2.1), 
the  system  updates  it  to  the  posterior 
odds . 

There  exist  potential  problems  of 
incoherency  in  this  approach.  For 
example ,  in  theory,  LR(e|H)  and  LR(E|H) 
satisfy 

LR(E|H)  =  (l-LR(E|H)xP(E|Ii))/(l-r>(E|H)'  (2.2) 

practice,  the  elicited  probabilities, 
provided  by  the  expert  often  violate 
eq.  (2.2).  Biases  do  exist  even  among 
well-trained  experts  (Kahneman,  Slovic 
and  Tversky  (1982),  and  Shafer  and 
Tversky  (1984)). 


Another  type  of  incoherency,  the 
difference  between  the  uncertainty  of 
user  and  expert  could  exist.  When 
implementing  an  Expert  System,  the  user 
makes  relevant  observations  and  provides 
Pu(E|0)  -  the  user's  probability  that  E 

is  true  given  0.  Previous  approaches 
make  the  implicit  assumption,  given  0, 
Pu(E|o)  =  P(E|o) .  In  practice,  it  is 

less  likely  that  the  user  and  the  expert 
will  be  coherent  with  each  other. 

There  are  other  problems  in  applying 
a  Bayesian  probabilistic  model  to  the 
Expert  System,  for  example,  the  indepen¬ 
dence  assumptions  needed  for  reducing  the 
computing  complexity.  Suppose  we  have  n 
rules  which  say  -  If  E^,  then  H  with 

probability  P^,  i=l,...n,;  we  can 

construct  a  combined  rule  which  says  -  If 
E,  then  H  with  probability  P,  where 
E  =  EinE2n...nEn  and  P  =  P(H|E). 

Theoretically,  we  can  compute  P  by  Bayes' 
rule.  However,  in  practice  the  likeli¬ 
hood  ratio  LRfE^nEjfl. . .  HEn  |H)  needed  in 

Bayes '  rule  is  not  provided  in  the  Expert 
System.  Therefore,  we  need  the 

k 

likelihood  ratios  for  n  E. ,  k  =  l,...,n, 

i=l  1 

given  H.  This  makes  the  model  extensive¬ 
ly  complicated  and  hence  inapplicable. 

To  simplify  the  computation,  two 
conditional  independence  assumptions  have 
been  made, 

P(Eir)E..|lI)  =  PtE^H)  x  P  (Ej  |H)  (2.3) 

P(EinEj|H)  =  P (Ei | H)  x  P(Ej|B)  (2.4) 

Under  these  assumptions,  we  can  compute 
the  posterior  odds  as 

n 

0(li|FiriE,n...nEn)  =  (  u  LR(E-  |  H) )  x  0(H)  (2.5) 

it  n  i=i 

The  assumptions  in  eq.  (2.3)  and  (2.4) 
had  been  given  in  the  Expert  System 
PROSPECTOR  and  were  extensively 
criticized.  One  of  the  most  notable 
papers  is  given  by  Pednault,  Zucker  and 
Muresan  (1981).  Using  a  result  from 
Hussian  (1972),  they  claim  that  under 
the  above  assumptions,  no  updating  can 
take  place.  This  result  has  been 
blunted  by  Glymour  (1985),  who  points 
out  an  algebraic  error  in  Hussian's 
derivation  and  hence  invalidates  the 
result  of  Pednault,  et  al. 

Overall,  the  researchers  of  Expert 
System  seem  to  understand  that  we  need 
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some  independency  assumptions  to  reduce 
the  computing  load.  They  also  agree 
that  the  independence  assumptions  are 
unrealistic.  To  reconcile  these 
conflicts,  their  attitude  is  to  try  to 
make  the  assumption  of  independence  as 
realistic  as  possible  (through  the 
design  of  the  Expert  System).  Hence, 
we  can  at  least  approximate  the  ideal 
Bayesian  probabilistic  model. 

There  has  been  a  different  approach 
to  problem-solving  systems  generated 
mainly  by  statisticians.  The  most 
notable  program  is  Kadane,  et  al. 

(1980).  The  main  difference  between 
this  approach  and  the  rule-based  Expert 
System  is  in  the  assumption  of  the  under¬ 
lying  statistical  model  and  the 
existence  of  a  prior  to  represent  the 
knowledge  of  the  expert. 

Philosophically,  there  are  different 
attitudes  toward  the  statistically- 
based  and  rule-based  Expert  System. 

That  is,  does  one  want  an  increasingly 
large  (due  to  addition  of  the  new  rule) 
and  essentially  deterministic  rule- 
based  Expert  System,  or  a  concise  and 
probabilistic  statistically-based 
Expert  System?  A  formal  comparison 
between  the  statistically-based  and  rule- 
based  Expert  System  should  be  attempted. 

In  summary,  the  Bayesian  probabilis¬ 
tic  model  has  a  concrete  theoretical 
foundation.  However,  such  practical 
problems  as  computational  burdens  and 
incoherent  probability  assessment  make 
it  less  applicable  in  Expert  System. 

S3.  Certainty  Factors 

The  certainty  factors  approach 
originates  from  Carnap's  confirma¬ 
tion  theory  (Carnap  (1950)). 

Instead  of  saying,  E  implies  H  or  E 
refutes  H,  the  probability  expresses 
the  degree  of  implication  of  H 
afforded  by  E.  According  to  Carnap's 
concept,  "probability  is  much  like 
personal  probability,  except  that  here 
it  is  argued  or  postulated  that  there  is 
one  and  only  one  opinion  justified  by 
any  body  of  evidence,  so  that  proba¬ 
bility  is  an  objective  logical 
relationship  between  an  event  A  and  the 
evidence  B"  (Savage  (1961)). 

The  design  of  the  MYCIN  system 
(Shortliffe  and  Buchanan  (1975))  is  an 
implementation  of  Carnap's  concept. 

They  define  a  measure  of  belief  (MB) , 
and  a  measure  of  disbelief  (MD)  as  the 
percentage  of  increases  (decreases)  on 
P (H)  to  P(H[E)  relative  to  what  is 
possible.  By  definition,  we_can  prove 
that  MD (H,E)  is  equal  to  MB (H, E) .  The 
overall  certainty  factor  (CF)  is 
defined  as 

CF (H,E)  =  MB (H, E)  -  MD(H,E)  (3.1) 

Four  combining  functions  have  been 
used  in  the  MYCIN  and  cause  a  lot  of 
criticism.  Adams  (1976)  has  shown  that 


the  first  combining  function  (incremental¬ 
ly  acquired  evidence)  implicitly  assumes 
the  independence  of  the  evidence,  a 
questionable  assumption  as  we  stated  in 
section  2.  The  second  and  third  combining 
functions  are  the  controversial  minimum 
and  maximum  rules  borrowed  from  the  fuzzy 
set  theory.  To  apply  the  fourth  combining 
function  (strength  of  evidence)  we  have 
to  assume  the  coherency  between  the  user 
and  the  expert,  a  property  that  is 
doubtful  in  a  real  world. 

In  conclusion,  we  can  view  the 
certainty  factors  model  as  a  Bayesian 
model  with  some  ad  hoc  combining  of  rules. 
From  a  theoretical  viewpoint,  it  does  not 
appear  useful,  but  it  does  have  value  from 
a  practical  point  of  view.  Perhaps  this 
could  be  best  expressed  by  quoting 
Shortliffe  and  Buchanans  "The  justifica¬ 
tion  of  our  approach  therefore  rests  not 
with  a  claim  of  improving  on  Bayes' 
theorem  but  rather  with  the  development 
of  a  mechanism  whereby  judgmental 
knowledge  can  be  efficiently  represented 
and  utilized  for  the  modeling  of  medical 
decision  making,..." 

§4.  Belief  Function 

Shafer's  belief  function  originates 
from  Dempster's  upper  and  lower 
probabilities  (e.g.  Dempster  (1967)). 
Assuming  there  is  a  set  0  (frame  of 
discrement)  of  n  mutually  exclusive  and 
exhaustive  propositions,  A^,A2, . . . ,Ar. 

Shafer  assigns  probability  mass  on  the 
power  set  of  0  according  to  a  basic 

g 

probability  function  m(«),  m:  2  -> 

10,1],  A  subset  A  of  a  frame  6  is  called 
a  focal  element  if  m(A)  >0.  The  belief 

function  Bel:  26  ->  [0.1]  is  defined, 
on  A,  a  subset  of  0,  as  the  sum  of  m(B) 
over  all  subsets  B  of  A.  Note  there  is 
a  1-1  relation  between  Bel(-)  and  m(>) 
given  by 

m (A)  =  l  (-l)XA-BNBel(B)  (4.1) 

BcA 

where  \A-B\  is  the  cardinality  of  A-B. 

The  plausibility,  P* (A) ,  is  defined  as 
1  -  Bel (A).  By  definition, 

Bel (A)  s  P* (A) .  When  Bel (A)  =  P*(A)  for 

g 

all  subsets  in  2  ,  the  belief  function 
reduces  to  the  conventional  probability. 
This  occurs  only  if  ra(*)  distributed  all 
the  mass  on  the  singletons  *]/•••  *An* 

Dempster's  rule  of  combination  has 
been  used  to  compute  the  combining  belief 
functions.  Let  m^,  m2  be  two  basic 

probability  assignments  over  the  same 
frame  0,  with  focal  elements  A^,...,Aj 

and  B1,...,Bj,  respectively.  If  the 
normalized  factor  K  = 

1  -  J.  m,  (Aj  )  m,(Bj)  >0,  then  the 
A.nB^*  *  3 
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Dempster's  rule  defines  their  orthogonal 
sum  as 

m.,(A)  =  (  l  m-i  (A,  1  m,(B.))/K  (.4.2) 

A^B  ,  =  A  11  *  3 

for  all  nonempty  subsets  A  of  0 , 

m12(0)  =  0. 

To  implement  belief  function  in  the 
rule-based  Expert  System,  the  user 
provides  his  belief  on  the  evidence,  and 
the  expert  provides  his  belief  on  each 
rule  in  the  system.  Dempster's  rule  of 
combination  is  then  used  to  combine  them. 

There  are  several  advantages  in  using 
the  belief  function  in  the  Expert  System. 

1.  Ignorance:  When  an  expert  (.user)  has 
complete  confidence  he  can  express  his 
opinion  as  a  probability.  But  when  he  is 
unable  to  commit  all  of  his  belief,  he  can 
choose  to  ignore  the  noncommi table  part. 

2.  Ability  to  handle  conflicting  evidence: 
Except  for  the  case  of  totally  contra¬ 
diction,  Dempster's  rule  of  combination 
provides  a  way  to  combine  expert's  and 
user's  belief  even  when  their  beliefs  are 
incoherent. 

However,  there  are  questions  raised 
about  the  implementation  of  belief 
function. 

1.  Computational  problem:  The  evalua¬ 
tion  of  the  degree  of  belief  is  time- 
exponential  in  the  cardinality  of  the 
propositions  set,  a  problem  for  which 
real-time  calculations  in  real-time 
situations  are  not  possible  on  today's 
computers . 

2.  Normalization  process:  The  normali¬ 
zation  process  used  in  Dempster's  rule  can 
lead  to  incorrect  results.  An 
instructive  example  has  been  given  by 
Zadeh  (1984)  to  show  that  the 
normalization  process  can  produce 
counter-intuitive  results  when  dealing 
with  conflicting  evidence. 

In  summary,  in  theory,  the  belief 
function  models  has  a  solid  foundation 
but  lacks  empirical  support.  In 
practice,  overloaded  computing  time  makes 
the  belief  function  model  almost 
inapplicable  at  present. 

§5.  Possibility  Theory 

Possibility  theory  originates  from 
Zadeh 's  fuzzy  set  theory.  The  formal 
definition  and  basic  operator  of  fuzzy 
set  theory  were  given  in  Zadeh's  1965 
paper.  Zadeh  argued  that  the  probability 
theory  may  be  appropriate  for  problems 
involving  the  measure  of  information.  It 
is  inappropriate,  however,  for  problems 
with  the  meaning  of  information.  To 
overcome  such  problems  caused  by  fuzziness 
(vagueness)  of  definition,  Zadeh  proposed 
the  fuzzy  set  theory,  which  provides  a 
formalism  for  treating  such  vagueness.  A 
new  terminology  has  been  introduced  by 
Zadeh  named  "membership  function",  which 
can  be  interpreted  as  a  measure  of 
fuzziness  for  inclusion  of  an  object  in  a 


set.  Let  X  be  the  space  of  points  (or 
objects)  of  interest,  say,  X  =  (x).  A 
fuzzy  set  A  in  X  is  characterized  by  a 
membership  function  f A f * ) ,  with  fA(x) 

representing  the  grade  of  membership  of  x 
in  A.  The  basic  assumptions  given  by 
Zadeh  can  be  expressed  as  follows: 


(i) 

0  <  f.  (x)  <1 

A 

(5.1) 

(ii) 

Wx)  -™*<fA(x).  fB(x)) 

(5.2) 

(iii ) 

fAnB(x)  fB(x)} 

(5.3) 

(iv) 

f A (x )  =  1  -  f A (x) 

(5.4) 

A,  B 

are  subsets  of  X;  A  is  the 

complement 

of  A. 


The  assumption  in  eq.  (5.1)  is  not 
necessary,  but  it  is  convenient.  When  A 
is  completely  specified,  fft(x)  takes  only 

the  value  of  1  or  0,  respectively, 
according  to  whether  x  does  or  does  not 
belong  to  A.  Thus  fft(x)  reduces  to  the 

ordinary  indicator  function  of  a  set  A. 

The  second  and  third  assumptions  are 
the  minimum  and  maximum  rules  for 
conjunction  and  disjunction.  The  last 
assumption  is  the  complement  rule.  The 
example  below,  given  by  Black  and  Eddy 
(1985) ,  shows  the  shortcomings  of  the 
above  rules.  Applying  A  and  a  to  rules 
(5.2),  (5.3)  and  (5.4)  we  have 

fAuA(x)  =  ™»x{fA<x>*  1  -  fA(x)}  (5-5) 
and 

fftnA(x)  =  mir.{fA(x),  1  -  fA(x>)  (5.6) 

The  left  hand  sides  of  eq.  (5.5)  and  (5.6) 
are  definitely  equal  to  1  and  0, 
respectively,  but  the  right  hand  side  of 
the  equations  are  not. 

There  are  other  critiques  of  the  fuzzy 
set  theory.  French  (1984)  argued  against 
the  fuzzy  set  theory  from  the  philosophical 
point  of  view.  Two  key  points  raised  by 
French  are:  (i)  Why  should  we  believe 
(or  assume)  that  the  fuzziness  in  our 
perception  is  well  (or  precisely)  modeled 
by  the  abstract  concept  of  a  fuzzy  model? 
(ii)  Since  emphasizing  imprecision  does 
not  seem  to  help  us  to  understand  the  model 
better,  why  bother  to  bring  in  another 
level  of  reasoning  about  fuzziness? 

The  other  question  often  raised  is, 

"How  can  the  grade  of  membership  be 
determined?".  In  all,  the  fuzzy  set  theory 
does  not  seem  to  contain  a  rational  or  an 
empirical  method  for  determining  the  value 
of  the  membership  function. 

So  far,  we  have  introduced  the  founda¬ 
tions  of  fuzzy  set  theory  and  mentioned 
various  critiques  of  it,  but  it  must  be 


said  that  we  might  expect  too  much  from  a 
field  with  only  twenty  years  of  history. 
There  has  been  much  exciting  research 
surrounding  the  field  of  fuzzy  set  theory. 

We  will  briefly  mention  some  of  it  below. 

A  tremendous  amount  of  effort  has  been 
put  forth  to  combine  the  probability  (or 
belief  function)  and  fuzzy  set  theory. 

Zadeh  points  out  that  the  concepts  of 
possibility  and  necessity  are  the  same  as 
the  concepts  of  support  and  plausibility 
in  Shafer's  belief  function.  Actually, 
if  we  discard  the  normalized  factor  in 
the  Dempster  rule  of  combination,  (as 
suggested  by  Zadeh),  the  theory  of  belief 
function  is  exactly  the  same  as  the  theory 
of  possibility  (Zadeh  (1984)). 

Another  area  of  interest  is  the  use  of 
fuzzy  set  theory  in  linguistic  approxima¬ 
tion  to  the  true  qualification.  Zadeh 
suggests  a  fuzzy-set  theoretic  interpreta¬ 
tion  of  linguistic  variables.  That  is  to 
say,  if  the  assertion  of  a  fact  is  not 
known  with  precision,  then  it  may  be 
characterized  linguistically  as,  say,  true, 
not  true,  very  true,  etc.,  with  each  of 
the  linguistic  expressions  representing  a 
fuzzy  subset  of  the  unit  interval.  Zadeh 
treated  such  fuzzy  reasoning  of  linguistics 
as  an  approximate  reasoning. 

It  has  been  argued  that  it  is  more 
appropriate  to  present  the  conclusions  in 
natural  language  form  than  in  numerical 
form  (e.g.  Bonissone  (1979)).  Also,  it 
has  been  argued  that  the  people  prefer  to 
express  their  beliefs  linguistically, 
rather  than  numerically.  If  the  above 
arguments  are  true,  then  it  is  very 
natural  to  implement  linguistic  approxima¬ 
tion  for  fuzzy  reasoning  in  the  Expert 
System.  However,  there  appears  to  be 
very  little  psychological  or  theoretical 
evidence  to  support  the  arguments  made 
above . 

In  summary,  fuzzy  set  theory  has  a 
sound  theoretical  foundation,  but  it  lacks 
normative  justification  as  a  belief  function. 
It  appears,  at  the  moment,  the  inclusion 
of  fuzzy  logic  in  models  of  inexact 
reasoning  adds  an  unnecessary  extra 
complication . 

§6.  Conclusion 

As  we  can  see  none  of  the  above  models 
seems  to  be  better  than  the  other  in  all 
applications.  Moreover,  the  previous 
sections  suggest  that  there  is  more  than 
one  type  of  uncertainty.  This  result 
suggests  that  we  should  use  the  different 
types  of  uncertainty  in  different 
situations  and  hence  different  models  in 
different  types  of  problems.  Further 
study  of  the  feasibility  of  the  multi¬ 
model  approach  to  uncertainty  in  the 
expert  system  is  needed.  Other  interest¬ 
ing  research  topics  can  be  found  in 
Hwang  (1986). 
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BOOTSTRAPPINC  KOLMOGOROV-SMIRNOV  STATISTICS,  II 
Alan  Julian  Izenman,  Temple  University 


Summary.  This  is  the  second  part  of  an 
empirical  investigation  into  bootstrapping  the 
two-sample  Kolmogorov-Smirnov  statistic  D(m,n). 
The  first  part  (ASA  Proc.  of  the  Statistical 
Computing  Section,  97-101,  1985)  dealt  with 
estimating  the  standard  error  of  D(m,n)  by  boot¬ 
strap  methods  and  showed  that  the  bootstrap 
performs  very  well,  especially  when  compared  with 
standard  asymptotic  approximations.  In  this 
paper,  we  carry  out  an  empirical  study  of 
percentile  estimation  for  the  Kolmogorov-Smirnov 
statistic  using  the  bootstrap  procedure.  As 
noted  for  other  situations  in  which  percentiles 
are  estimated  by  the  bootstrap,  the  number  of 
bootstrap  replications  has  to  be  large  to  obtain 
reasonable  estimates,  and  even  then  those 
estimates  are  slightly  on  the  low  side.  We  also 
consider  a  logarithmic  transformation  of  D(m,n), 
which  has  been  suggested  in  the  literature  as 
having  an  approximately  normal  distribution  for 
large  m  and  n. 

1.  Introduction.  The  usual  nonparametric 
two-sample  problem  can  be  stated  in  the  following 
manner.  Given  two  independent  random  samples, 

Xl,X2,....Xm^  p,  Y1,Y2,...,Yn  ^G, 

where  F  and  G  are  both  continuous,  but  unknown, 
distribution  functions,  and  m  <  n,  we  are 
interested  in  comparing  F  with  G  to  see  whether 
they  are  in  fact  the  same.  The  statistic  that  we 
consider  here  is  the  classical  two-sample 
Kolmogorov-Smirnov  distance, 

D(m,n)  «  suPxeRlFm(x)  -  Gn (x) [ , 

where  F  and  G  are  the  respective  sample  distri¬ 
bution  functions  obtained  by  placing  mass  1/m  on 

each  X^  (i=l ,2 . m)  and  mass  1/n  on  each  Y. 

(j=l ,2 , . . . ,n) .  Large  values  of  D(m,n)  suggest 
evidence  against 


while  small  values  of  D(m,n)  favor  H.  over  the 
alternative,  that  F  +  G.  Because  of  the  global 
nature  of  the  alternative,  the  Kolmogorov-Smirnov 
statistic  has  been  criticised  as  not  having  high 
sensitivity  to  detect  specific  types  of  departure 
between  the  two  distribution  functions.  However, 
the  statistic  is  used  sufficiently  to  warrant  an 
investigation  as  presented  here. 

The  distribution  theory  associated  with  the 
two-sample  Kolmogorov-Smirnov  statistic  D(m,n) 
has  been  well  documented.  See  Izenman  (1985)  for 
a  summary,  where  it  was  pointed  out  that  most  of 
the  results  are  complicated  algebraically  and  are 
unsuitable  for  computation.  All  too  often  authors 
resort  to  using  asymptotic  approximations  (for 
large  m  and  n)  to  moments,  critical  values,  or 
percentage  points  when,  in  fact,  the  sample  sizes 
are  small.  Even  then,  the  asymptotic  distribution 
of  D(m,n)  is  not  normal,  but  involves  an  infinite 
summation  whose  value  is  approached  most 


erratically. 

In  the  first  part  of  this  empirical 
investigation  (see  Izenman  1985),  the  bootstrap 
procedure  due  to  Efron  (1979,  1982)  was  applied  to 
evaluate  the  standard  error  of  D(m,n)  under  the 
null  hypothesis.  Two  possible  bootstrap  sampling 
procedures  were  compared  for  the  two-sample 
problem: 

(a)  "separate"  bootstrapping,  in  which  a 
bootstrap  sample  was  drawn  (sampling  with  replace¬ 
ment)  from  the  first  sample,  and  then  an  Independ¬ 
ent  bootstrap  sample  was  drawn  from  the  second 
sample; 

(b)  "combined"  bootstrapping,  in  which  the 
two  original  samples  were  pooled  to  form  a 
combined  sample  of  size  m+n,  and  then  two  boot¬ 
strap  samples  were  drawn  (with  replacement)  from 
the  combined  sample. 

In  both  cases,  two  bootstrap  samples,  one  of  size 
m  and  the  other  of  size  n,  were  generated  and  the 
statistic  D*(m,n)  was  computed,  where 

D* (m,n)  =  supxeR|F*(x)  -  G*(x)|, 

F  and  G  being  the  respective  sample  distribution 
functions  of  the  two  bootstrap  samples.  This 
procedure  was  repeated  a  large  number  B  times, 
yielding  B  bootstrap  replications 

D*1(m,n) ,D*2(m,n) .... ,D*B(m,n) . 

These  B  values  could  then  be  used  to  estimate 
functionals  of  F  and  G.  It  was  shown  that  the 
standard  deviation  of  these  B  values,  namely, 

{(B-l)_1^=1(D*b  -  D*  l2}*4, 

where 

D*.  *  B'lj:b=i  DV 

is  an  excellent  estimator  of  the  standard  error  of 
D(m,n)  under  the  null  hypothesis  when  sampling  is 
carried  out  using  the  "combined"  bootstrap 
procedure.  The  "separate"  bootstrap  procedure  is 
uniformly  poorer  for  estimating  those  same 
standard  errors.  Simulations  were  carried  out  in 
each  case  by  sampling  with  replacement  from  the 
uniform  distribution  on  (0,1)  according  to  the 
values  of  m  and  n,  carrying  out  the  appropriate 
bootstrap  recipe  described  above,  and  repeating 
the  procedure  T  (■  #  trials)  times.  The  resulting 
standard  deviations  were  then  averaged  over  all  T 
trials  and  compared  with  the  exact  standard  error 
and  an  asymptotic  approximation.  The  simulation 
parameters  chosen  were: 

n  =  25,50,  m  -  5(5)n,  B  =  100,  T  -  100. 

As  a  footnote  to  part  one  of  this  investigation, 
we  have  recomputed  the  simulations  for  n  •  25  and 
m  *  5(5)25  using  B  «  1000  and  T  *  20.  The  results 


are  given  in  Table  1  for  the  "combined"  bootstrap 
procedure  only.  It  appears  that  accuracy  in 
estimating  standard  errors  of  D(m,n)  is  improved 
by  increasing  B,  the  number  of  bootstrap 
replications,  and  not  the  number  of  trials  T.  In 
fact,  variability  of  the  bootstrap  standard 
deviation  has  sharply  decreased  by  going  from 
B  -  100,  T  -  100  to  B  =  1000,  T  =  20.  It  is, 
therefore,  clear  that  the  "combined"  bootstrap 
can  be  used  with  a  high  degree  of  confidence  to 
estimate  the  standard  error  of  D(m,n)  under  the 
null  hypothesis. 

In  Section  2,  we  consider  the  problem  of 
estimating  percentiles  of  the  distribution  of 
D(m,n)  under  the  null  hypothesis  using  the 
"combined”  bootstrap  procedure.  Then,  in  Section 
3,  we  consider  a  suggested  logarithmic  transform¬ 
ation  of  D(m,n) .  Some  further  directions  of 
research  in  this  area  are  discussed  in  Section  4 
with  particular  reference  to  situations  involving 
censored  data. 

2.  Percentile  estimation.  Estimating 
percentiles  of  a  distribution  is  a  much  harder 
problem  than  estimating  standard  errors.  This  is 
especially  true  when  using  the  bootstrap.  The 
bootstrap  method  assumes  implicitly  that  the  true 
distribution  is  supported  on  the  observed  data 
points.  Hence,  the  number  of  bootstrap 
replications,  B,  has  to  be  larger  to  obtain 
reasonable  accuracy  in  the  tails  of  the  distri¬ 
bution. 

The  "percentile  method"  (Efron  1979,  1982) 
is  a  straightforward  procedure  for  estimating 
percentiles  (and  confidence  intervals)  from  the 
results  of  bootstrap  sampling  by  finding  the 
appropriate  percentiles  of  the  bootstrap 
distribution.  To  be  specific,  let 

CDF(t)  ■  ProbJk{D*(m,n)  <  t} 

be  the  cumulative  distribution  function  of  the 
bootstrap  distribution  of  D*(m,n).  For  0  <  a  <  *5, 
the  (l-a)xlOO-th  percentile  of  the  distribution 
of  D(m,n)  is  estimated  by 

PP(100(l-a))  =  CDF_1(l-a). 

As  in  the  previous  Section,  two  bootstrap  simul¬ 
ations  were  compared  with  n  =  25  and  m  =  5(5)25: 

PM1 :  B  =  100,  T  =  100 

PM2:  B  -  1000,  T  =  20. 

For  the  purposes  of  simulation,  CDF(t)  was 
approximated  by  #{D*.  (m,n)  <  t}/B  for  each  trial, 
and  averaged  over  all  T  trials.  A  plus-or-minus 
figure  was  also  calculated  using  the  standard 
deviation  of  a  specific  percentile  estimate  over 
the  T  trials.  The  results  are  given  in  Table  2. 
"True"  values  were  obtained  by  linearly  inter¬ 
polating  in  the  tables  of  Kim  and  Jennrich  (1970). 

The  bootstrap  estimates  of  percentage  points 
of  D(m,n)  were  found  to  be  slightly  on  the  low 
side,  as  would  be  expected.  In  the  simulations, 
we  only  considered  90,  95,  and  99  percent  points, 
and  these  were  estimated  between  three  and  14% 
too  low.  More  centrally  located  percentiles, 
such  as  the  68th  percentile,  should  be  estimated 
better;  however,  such  lower  percentiles  do  not 
appear  in  any  published  set  of  tables  for  ready 
comparisons.  Table  2  shows  that  even  with  B-1000, 


the  bootstrap  percentile  method  is  still  quite  low 
in  estimating  true  percentiles  of  the  distribution 
of  D(m,n).  Even  allowing  for  variability  over  the 
T  trials,  percentiles  were  not  estimated  anywhere 
as  well  as  were  standard  errors.  The  "bias- 
corrected  percentile  method"  (Efron  1982)  did  not 
appear  to  improve  the  estimates  significantly  and 
the  results  are  not  given  here. 

3.  Transformation  of  D(m,n) .  The  normal  distri¬ 
bution  is  not  the  standard  large-sample 
approximation  to  D(m,n).  However,  Kim  (1969)  has 
suggested  on  the  basis  of  empirical  simulation 
studies  that,  for  large  m  and  n, 

U(m,n)  =  loge{D(m,n)/E^F  F^D(m,n)} 

has  approximately  a  normal  distribution  with  mean 
-0.0450  and  variance  0.0898.  The  drawback  to 
using  this  transformation  was  that  "one  has  to 
have  the  exact  mean  of  D(m,n),  an  awesome  task 
...  for  n  >  50,"  according  to  Kim.  If  we  boot¬ 
strap  D(m,n)  and  replace  the  exact  mean  by  D(m,n) 
itself,  then  we  have  B  bootstrapped  versions  of 
U(m,n),  namely, 

U* j (m,n) ,U*2(m,n) .... ,U*B(m,n) . 

The  assertion  regarding  the  approximate  normality 
of  U(m,n)  can  be  checked  via  a  normal  probability 
plot  of  the  U*B(m,n)  values. 

As  an  example,  we  used  the  stamp  thickness 
data  from  Izenman  and  Sommer  (1985);  see  also 
Example  2  in  Izenman  (1985).  The  data  consist 
of  two  samples  of  measurements,  one  on  m  «  24 
stamps  watermarked  "Papel  Sellado"  and  the  other 
on  n  »  289  unwatermarked  stamps,  both  sets  part 
of  the  1872  Hidalgo  Issue  of  Mexico.  The 
"combined"  bootstrap  procedure  was  applied  to  the 
two  samples,  and  the  B  -  1000  values  of  U*^ (m,n) 
were  obtained.  The  mean  and  variance  of  those 
1000  values  were  -0.054  and  0.110  respectively, 
and  the  normal  probability  plot  exhibited  a  clear 
linear  configuration. 

Simulation  results,  not  shown  here,  showed 
this  transformation  to  be  reasonable. 

4.  Censored  data.  So  far,  discussion  in 
this  paper,  and  in  Izenman  (1985),  has  been 
confined  to  the  complete  sample  situation. 

Recently,  a  number  of  papers  have  appeared  in  the 
literature  in  which  the  Kolmogorov-Smimov 
statistic  is  used  to  compare  two  survival  curves 
(or,  distribution  functions)  for  right-censored 
data.  We  refer  the  reader  to  Barr  and  Davidson 
(1973),  Koziol  and  Byar  (1975),  Dufour  and  Maag 
(1978),  Fleming  et  al  (1980),  Breslow  et  al  (1984), 
and  Sandford  (1985). 

There  is  some  controversy  regarding  the 
suitability  of  the  Kolmogorov-Smirnov  statistic 
for  comparing  censored  survival  data.  Certain 
authors  (such  as  Fleming  et^  a_l  1980)  prefer  the 
Kolmogorov-Smirnov  statistic  over  the  logrank 
and  Gehan-Wilcoxon  statistics  in  such  situations. 

As  Fleming  £t  al  remark,  "it  has  been  our  frequent 
experience  that  substantial  differences  between 
two  survival  distributions  may  be  apparent  at  one 
point  in  time,  but  fail  to  exist  elsewhere.  For 
example,  certain  treatments  for  coronary  heart 
disease  yield  remarkably  improved  long-term 
survival,  even  though  survival  limiediately 
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|  following  onset  of  treatment  may  be  worse  than 

!  that  obtained  with  less  aggressive  alternative 

1  treatments.”  For  detecting  these  types  of 

1  'crossed-hazards  departures'  from  the  null 

hypothesis,  a  modified  version  of  the  Kolmogorov- 
Smirnov  statistic  is  preferred  to  the  logrank  or 
Gehan-Wilcoxon  statistics,  which  appear  to  be 
insensitive  to  such  departures.  An  alternative 
i  view  is  given  by  Breslow  et  al^  (1984),  who 

i  develop  a  complementary  criterion  to  be  used 

I  in  conjunction  with  the  logrank  procedure. 

I  Comparisons  between  these  different  methods 

1  of  studying  differences  between  two  survival 

I  curves  have  involved  the  use  of  asymptotic  theory 

'  and,  for  small  and  medium  sized  samples,  Monte 

Carlo  simulations.  Certain  of  the  asymptotic 
results  lead  to  normal  approximations  of  the 
distributions  of  the  statistics  considered.  The 
references  listed  above  also  include  real  data 
studies  for  purposes  of  illustration  of  the 
statistics. 

|  It  seems,  therefore,  that  bootstrapping 

i  Kolmogorov-Smirnov  statistics  (or,  possible 

modifications)  can  also  be  applied  in  the 
,  presence  of  censored  data.  In  fact,  Efron  (1981) 

i  has  investigated  the  use  of  the  bootstrap  for 

i  the  Kaplan-Meier  product-limit  estimated  survival 

curve.  The  bootstrap  was  used  to  assess  the 
standard  error  of  the  Kaplan-Meier  curve, 

|  functions  (such  as  location  estimates)  of  the 

Kaplan-Meier  curve,  and  associated  confidence 
intervals.  Similar  questions  can  be  asked  of 
the  Kolmogorov-Smirnov  and  related  statistics, 
based  on  bootstrap  considerations.  These 
questions  will  be  addressed  elsewhere. 
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’COMBINED"  BOOTSTRAP  SEs 


n  m 

true  SE 

ASE 

B  =  100, 

Exp . 

T  =  100 

StDev. 

B  =  1000 

Exp. 

,  T  =  20 

StDev. 

25  5 

■BB 

0.128 

0.122 

0.010 

0.121 

10 

0.096 

0.096 

15 

0.085 

0.084 

0.084 

0.002 

20 

0.077 

0.078 

0.078 

0.007 

0.077 

0.002 

25 

0.073 

0.074 

0.074 

0.006 

0.074 

0.002 

Table  1.  Comparison  of  true  standard  error,  asymptotic  standard 
error  (ASE)  using  Smirnov  approximation,  and  two  bootstrap  estimates 
of  standard  error  using  the  "combined"  procedure.  Computations  were 
carried  out  on  a  CDC  Cyber  170/750  mainframe  computer  using  Fortran 
programs  and  IMSL  calls  to  subroutines  GGUD  and  NKS2. 


n  m  Method  PP90  PP95  PP99 


5 

"true" 

PP 

0.572 

0.631 

0.748 

PM1 

0.532 

♦ 

0.031 

0.583 

+ 

0.038 

0.676 

+ 

0.057 

PM2 

0.528 

± 

0.016 

0.596 

♦ 

0.018 

0.700 

± 

0.024 

10 

"true" 

PP 

0.439 

0.487 

0.580 

PM1 

0.407 

± 

0.017 

0.453 

+ 

0.024 

0.525 

± 

0.041 

PM2 

0.415 

± 

0.009 

0.464 

± 

0.011 

0.562 

± 

0.016 

15 

"true" 

PP 

0.384 

0.426 

0.510 

PM1 

0.362 

+ 

0.020 

0.403 

0.026 

0.467 

± 

0.037 

PM2 

0.367 

+ 

0.010 

0.407 

± 

0.010 

0.487 

± 

0.012 

20 

"true" 

PP 

0.353 

0.392 

0.470 

PM1 

0.335 

+ 

0.018 

0.369 

♦ 

0.021 

0.429 

± 

0.037 

PM2 

0.337 

♦ 

0.007 

0.378 

+ 

0.006 

0.455 

± 

0.009 

25 

"true" 

PP 

0.349 

0.386 

0.461 

PM1 

0.318 

+ 

0.020 

0.350 

+ 

0.024 

0.406 

± 

0.036 

PM2 

0.320 

0.000 

0.358 

+ 

0.009 

0.430 

♦ 

0.022 

Table  2.  Comparison  of  true  90,  95,  and  99  percent  points  (PP)  of  the 
Kolmogorov-Smirnov  statistic  with  bootstrap  estimates  using  the  percentile 
method  (PM).  The  "true"  values  were  linearly  interpolated  from  Kim  and 
Jennrich  (1970)  tables.  The  first  percentile  method  (PM1)  used  B  =  100  and 
T  =  100;  the  second  percentile  method  (PM2)  used  B  =  1000  and  T  =  20.  Entries 
shown  are  mean  ±  stdev  over  the  T  trials.  Computations  were  carried  out  as  for 
Table  1. 


CALCULATING  IMPROVED  BOUNDS  AND  APPROXIMATIONS  FOR  MULTIPLE  COMPARISONS 


James  R.  Kenyon,  University  of  Connecticut 


When  applying  multiple  comparison  pro¬ 
cedures  to  a  particular  problem,  the 
simultaneous  confidence  intervals  usually 
are  conservative.  This  occurs  because 
procedures  either  use  bounds  in  the 
coverage  probability  statement,  or  pro¬ 
ject  a  multivariate  region  to  a  multi¬ 
variate  rectangle  (l.e.,  expand  a 
confidence  region  to  a  confidence  rec¬ 
tangle)  (Miller  1981).  Even  conditional 
confidence  intervals  may  not  have  the 
coverage  claimed  (Meeks  and  D'Agostino 
1983). 

Recently,  there  has  been  interest  in 
obtaining  improvements  in  bounds  for 
multivariate  probabilities  (Games  1977, 
Glaz  and  Johnson  1984,  Miller  1981,  and 
Worsley  1982).  It  has  been  shown  that 
the  "usual,"  e.g.,  Bonferroni  type 
bounds,  often  are  not  very  useful,  parti¬ 
cularly  when  there  are  many  events,  B^ , 

and  the  P(B^)  are  not  "small,"  or  when 

there  is  a  strong  dependence  structure  in 
the  multivariate  distribution  (Glaz  and 
Johnson  1984,  Miller  1981,  Schwager  1984, 
and  Worsley  1982). 

Let  us  consider  three  problems  for 
simultaneous  confidence  intervals  and 
develop  methods  to  calculate  the  coverage 
probability,  or  improved  bounds  for  this 
probability.  To  obtain  the  approximation 
for  the  coverage  probability,  we  will  be 
using  several  bounds  previously  developed 
but  not  evaluated  for  these  problems, 
including  bounds  using  a  conditional 
probability  approach  developed  by  Glaz 
and  Johnson  (1984). 

These  bounds  are  (where  A^  is  any 
event ) : 

1)  1st  Bonferroni  bound 
t  t 

P(  u  A,  )  <  r  P( A, ) 
i=l  1  i=l  1 


2)  2nd  Bonferroni  bound 

t  t  t 

P(  U  A.)  >  I  P  ( A,  )  -  I  P(  A,  O  A,  ) 

1=1  1  1=1  1  i<k  1  K 


3)  1st  Worsley  bound 
t  t 

P(  U  A.  )  <  E  P ( A .  ) 
i=l  1  i=l 


t-1 


-  E  P(  A .  O  A 
1  =  1  1 


i  +  1 


) 


4)  2nd  Worsley  bound 

t  t  ~  t 

P(  U  A.  )  <  E  P  ( A .  )  -  f  E  P(  A,  DA.) 

i=l  1  i=l  z  i<k  1  K 


5)  Galambos  bound 


4  2 

P( U  A, )>  £ 
i  =  l  1  K 


t 

E  P ( A,  ) 
i  =  l  1 


2 

k(k-l) 


t 

E  P  ( A .  n  A,  ) 
i<h  1  n 


for  k  >_  2  and  optimal 
t  t 

k  •  [2  E  P ( A,  n  A.  )/  E  P ( A,  )]  +  2 
i <h  1  n  1=1  1 


6)  Sidak  bound 

t  t 

P(  U  A,  )  <  1  -  it  P( A.  ) 
i=l  i=l  1 


7)  Glaz  and  Johnson  bounds  and  approxi¬ 
mations  : 

t  t 

a)  P(  U  A.  )  <  1  -  it  P(A.  ) 

i=l  1  k=l  * 

b)  P^A,)  1  1  -  P(A1)k^P(Ak|Ak_1) 

t 

c)  P( U  A  )  <  1  -  P( A.  n A  ) 

i  =  l  t  1  * 

x  k!3p(A k|Ak-inAk-2> 


d)  P(  U  A.)  <  1  -  P(A.nA,nfl  ) 

1=1  t  J 

*  k4(VAk-inAk-2nAk-3) 

e)  P(  U  A.)  <  1  -  P(A.nA  GA  nAj 

i=l  1  1  d  i  * 

x  k!5P(AklAk-inAk-2nAk-3nAk-4) 


Note  Glaz  and  Johnson  have  only  obtained 
their  third  bound,  c),  for  special  cases. 
Calculations  for  d)  and  e)  have  not  been 
obtained  previously.  Additionally,  the 
conditional  approximations  b)  -  e)  of 
Glaz  and  Johnson  are  not  always  guaran¬ 
teed  to  be  bounds.  A  sufficient,  but  not 
necessary,  condition  for  these  approxi¬ 
mations  to  be  bounds  is  that  the  multi¬ 
variate  distribution  be  multivariate 
totally  positive  of  order  2  (MTPj).  For 

the  multivariate  normal,  MTP^  is  equiva¬ 
lent  to  all  the  partial  correlations 
being  >  0.  Note  that  in  the  calculation 
of  the  bounds,  one  does  not  have  a  t- 
dimensional  multivariate  normal  or  multi¬ 
variate  t,  but  2-dimensions  in  most  cases 
and  5-dimension  for  the  worst  case. 

First,  consider  a  simple  control  pro¬ 
blem  for  a  one-way  analysis  of  variance 
with  a  balanced  design.  Let  Y^k  be  dis¬ 
tributed  as  independent  normal  random 
variables  with  mean  y^  and  variance  a2 

for  i  =  0,1,2,. . .  ,t  and  k  =  1,2,. ..,n. 
Yok,  k  =  l,2,...,n,  is  the  control  group 

and  the  remaining  t  groups,  each  of  n 
observations,  are  the  treatment  groups. 
Thus,  the  Y^ ' s  are  distributed  as 

independent  normals  with  mean  y^  and 

variance  oJ/n  for  i  =  0,1,2 . t.  In 

this  problem,  one  is  usually  interested 
in  comparing  the  t  treatments  to  the 
control.  The  resulting  contrasts  are 
]Yq  -  Yj | /s/27n  <  c  for  i  =  1,2,.  ...t. 


where  S2  is  the  usual  pooled  estimate  of 
a2 . 

1 Y0-Yi | /a2 

1  -  a  =*  P(— - — -  <  c;  i  =  1,2,. 


S/2/n/o 2 

Without  loss  of  generality,  we  can  assume 
that  o2  =  1  and  y,  =  0  for  i  =  0,1,2, 

...,t. 

CO 

1  -  a  =  /  P(|Yn  -  Y.  [  <  CS/2AT  ; 

0 

i  =  1,2,. . . ,t|S  =  s)  f  (s)  ds 


=  /  /  P(a  <  Y  <  b; 


0  -«> 

i  =  1 , . . . ,t | S  =  S,  Y0  =  y) 

X  (y)  f„(s)  dy  ds 
Y0  b 

00  CO 

=  /  /  [P(a  <  Y  <  b|S  =  s,  Y^y)]6 

0-<o 

x  Vy  (y)  f„(s)  dy  ds 
*0  b 

where  a  =  Y^  -  cS/2/n  and  b  =  Yq  +  cS/2/n. 

Since  Y^,  i  =  0,1,2,. ..,t,  are 

independent  and  identically  distributed, 
and  each  is  independent  of  S,  fy  (y) 

=  /n/2m  e~^ny  .  Let  v  =  d.f.  =  (t+1) 
x  (n-1),  then  vS2/a2  is  distributed 


Chi-Squared 


(v)' 


From  this,  the  density 


for  S  is  f-(s)  =  vv/2  S^1  e~^vS2 

r(5sv) 


Now  let  z  =  y /n/2 ,  and  w  =  hvS2 . 
These  transformations  yield 


co  IjV-l 

1  -  a  =  1/A  l  mTj 


L[$0,i/n(b) 


-$0,l/n(a)]t  e_Z  dz  dW 

where  4>  2(x)  is  the  c.d.f.  at  x  for  a 

U  >  C* 

normal  distribution  with  mean  y  and 
variance  a2,  a  =  (z  -  c/2w/v)  x  /2/n  and 
b  =  (z  +  c/2w/v)  x  /2/n. 

Now,  recall  the  Gauss-Laguerre 
formulas : 

oo  n 

/  e"x  f(x)  dx  ~  I  A  f(x. ) 

0  i  =  l  1 

and  the  Gauss-Hermite  formulas: 

°°  2  n 

/  e_x  f(x)  dx  =  l  B1[f(x1)  +  fC-Xj^)] 

_<x>  i  =  l 


1  t  2B, f ( x . ) 

1  =  1  1  1 

if  f(x)  is  symmetric  about  0. 

n  1 

Thus,  1  -  a  =  (1/A)  ZWA1  wi 

i  =  l  r(?sv) 

nz 

*  2Bk[*0,l/n(b)  -  *0,l/n 


Where  nw  is  the  number  of  points  used 
in  the  Gauss-Laguerre  approximation,  n^ 

is  the  number  of  points  used  in  the  Gauss- 
Hermite  approximation,  a  =  ( z^-  c/2w^7v) 

x  /2/n,  b  =  (z^  +  c/2w ^/v)  x  /2/n  and 
$  _2(x)  is  as  above. 

y  j  o 

Since  the  variables  used  for  condition¬ 
ing  have  infinite  limits  of  integration 
and  the  weight  function  appears  in  the 
density,  the  Gaussian  methods  do  have  a 
natural  advantage.  Other  methods  require 
either  truncation  of  the  integral  besides 
performing  an  approximation  of  this 
truncated  integral,  or  usage  of  a  trans¬ 
formation  of  the  variable  of  integration 
which  yields  finite  limits  of  integration. 
In  addition,  Stroud  and  Secrest  give  a 
comparison  of  Gaussian  quadrature  with 
other  methods.  For  an  equal  number  of 
points  the  Gaussian  quadrature  error  is 
comparable,  even  for  cases  where  it  is 
not  believed  the  "best.”  They  also  com¬ 
pare  different  approaches  to  calculating 
some  specific  integrals  with  infinite 
limits,  including  transformations  to 
integrals  with  finite  limits.  Moreover, 
Gaussian  quadrature  absorbs  part  of  the 
densities  in  its  weight  function  along 
with  cancellation  of  other  terms  in  the 
transformation  to  this  form.  All  these 
results  appear  to  indicate  this  approach 
favored  for  these  particular  densities. 

Notice  that  the  above  formula  is  for 
the  exact  coverage  probability  and  the 
same  techniques  can  be  applied  to  cal¬ 
culating  probabilities  for  any  number  of 
events  in  this  problem  that  we  wish. 
Consequently,  we  can  easily  compute  and 
compare  the  bounds  with  the  exact  proba¬ 
bility.  Also,  the  multivariate  density 
in  the  control  problem  is  MTP.,,  thus 

guaranteeing  the  conditional  bounds,  but 
there  is  not  much  dependence  structure 
for  the  conditional  probability  bounds  to 
take  advantage  of.  See  Table  1  for  a 
comparison  of  the  bounds  and  exact  proba¬ 
bility  for  some  particular  cases  that  have 
previously  been  presented  in  the  litera¬ 
ture. 

Secondly,  let  us  consider  the  same 
problem,  but  for  unbalanced  data.  That 
is,  n  Is  not  necessarily  the  same  for 
each  group.  Thus,  we  replace  n  above 
with  n^.  We  also  allow  c  to  vary  for 

each  group  and  replace  it  with  c^.  The 
coverage  probability  1  -  a  is : 

P( I Y0-Y± I /S  /l/n0+l/n1  <  Cl;i  =  1,2,. ...t) 


^v-l 


00  t 


■  1/A  [  W  e""J  0,1/n, 


(b) 


*C,l/ni(a)]  e“  dz  dz  • 


(a)]1. 


Where  in  the  above,  b 


( z  +  c^/2w/v) 


x  /1/nQ+l/n^  ,  a  =  (z  -  c^ /2w/v) 

*  /1/nQ+l/r^  ,  and  4^  Q2(x)  is  the  same 
as  earlier.  Thus 

nw  w^"1 

1  -  a  = 


n. 


-  ‘o.l/n,'*” 


where 


advantage  of  as  was  done  in  the  earlier 
cases  to  obtain  conditional  independence . 
Additionally,  there  is  no  guarantee  that 
(k(X'X)~k')  has  nonnegative  off 
diagonal  elements  for  the  MTP^  property. 

Lastly,  Gauss-Legendre  formulas  will  be 
used  to  calculate  the  probabilities 
necessary  for  the  bounds  under  study. 

For  computational  accuracy  and  efficiency, 
matrix  decompositions  and  numerical 
methods  for  linear  algebra  must  be 
utilized. 

By  taking  advantage  of  any  dependence 
structure  and  the  form  of  the  density, 
improved  and  even  exact  coverage  proba¬ 
bilities  can  be  calculated  for  multiple 


a  =  (z.  -  c  /2w  /v)  x  /T7nZ+T7rT  , 
Kim  u  i 

and 

b  =  (zk  +  ci/2wm/v)  x  Jl/nQ+l/n^  . 

There  is  more  computing  to  be  done 
here,  but  for  many  applications  the 
exact  probability  can  usually  be  calcu¬ 
lated  along  with  the  bounds.  This  is 
particularly  interesting  for  the  control 
problem  since  many  practitioners  often 
allocate  more  units  to  the  control  group. 
Dunnett  (1964)  has  given  tables  and  for 
optimal  allocation  suggests  ng/n^  =  /t 

holds  when  the  joint  confidence  level  is 
>_  0.95.  Refer  to  Table  2  for  a  compari¬ 
son  of  the  bounds  and  exact  probabilities 
for  several  cases  previously  presented  in 
the  literature. 

For  the  last  problem,  let  the  random 
variable  Y  have  a  normal  distribution 
with  a  mean  of  XB  and  a  variance  of  a2 I, 
where  X  is  a  known  design  matrix  and  B 
is  a  vector  of  unknown  parameters.  For 
t  arbitrary  contrasts,  k,B  which  must  be 

estimable,  the  P( |kjB|/s(kjB)  <  c1  for 
i  =  l,2,...,t)  is  the  multivariate  proba¬ 
bility  of  interest,  where  sCk^B)  is  the 

sample  standard  deviation  of  k.B  and  B 
is  an  estimate  of  B. 

For  arbitrary  contrasts  only  condi¬ 
tioning  on  S  readily  applies.  This 
gives 

P(  |k1B|/sA1(x'X)-k1'  <  c1;  i  =  l,2,...,t) 

®  ~  _ _ 

=  /  fs(s)P(  |k1Bj  <  c1sA1(X'X)-k1 '  ; 

i  =  1,2,. . . ,t |S  =  s)  ds 
where 

PC  |  kiB  |  <  ciSAi(X’X)-K1'T  ; 

i  =  1,2, . . . ,t |S  =  s) 

is  a  multivariate  normal  probability. 

This  will  require  much  more  computational 
work  since  in  general  there  is  no  depen¬ 
dence  structure  that  can  readily  be  taken 


comparison  problems  or  more  generally  for 
simultaneous  confidence  rectangles.  With 
these  capabilities,  the  inverse  problem 
of  determining  c  given  a,  n,  and  t  can  be 
performed  as  was  done  for  Table  1. 
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TABLE  la 


I  Games  (1977)  used  the  Sidak  product  bound  to  produce  tables  for  contrasts. 

I  The  following  values  for  c  were  selected  from  this  table  for  the  degrees 

|  of  freedom  for  error,  number  of  contrasts,  and  alpha  level  given.  Gamma  1 

|  through  5  are  the  Glaz  and  Johnson  bounds. 


t 

2 

5 

9 

9 

n 

2 

5 

4 

4 

d.  f. 

3 

29 

30 

30 

confidence  level 

0.80 

0.99 

0.90 

0.80 

c 

2.294 

3.965 

2.687 

2.369 

Lower  Bounds 

on  the  Confidence 

Level 

1st  Bonferroni 

0.78886 

0.98973 

0.89521 

0.77976 

Sidak  (Gamma  1) 

0.80001 

0.98973 

0. 89996 

0.80013 

1st  Worsley 

0.83165 

0.99046 

0.90729 

0.81441 

2nd  Worsley 

0.83165 

0.99046 

0.90729 

0.81441 

Gamma  2 

0.83165 

0.99048 

0.91014 

0.82302 

Gamma  3 

— 

0. 99088 

0.91648 

0.83680 

Gamma  4 

— 

0.99109 

0.92063 

0.84561 

Gamma  5 

— 

0.99118 

0.92341 

0.85138 

Upper  Bounds 

on  the  Confidence 

Level 

2nd  Bonferroni 

0.83165 

0.99155 

0.94956 

0.92238 

Galambos 

0.83165 

0-99155 

0.94826 

0.90070 

Exact  level 

0.83165 

0.99118 

0.92735 

0.85937 

Correct  c 

2.1365 

3. 4027 

2.5391 

2.1867 

TABLE  lb 


Dunnett  (1955,  196*0  presented  tables  for  the  first  problem  presented  here. 
The  following  values  for  c  were  taken  from  the  196*1  article  and  thus  should 
be  exact. 


t 

2 

5 

5 

9 

n 

3 

3 

5 

4 

d.  f. 

6 

12 

24 

30 

confidence  level 

0.95 

0.95 

0.99 

0.95 

c 

2.86 

2.90 

3.40 

2.86 

Lower  Bounds 

on  the  Confidence 

Level 

1st  Bonferroni 

0.94240 

0.93337 

0.98821 

0.93124 

Sidak  (Gamma  1) 

0.94323 

0.93512 

0.98827 

0.93331 

1st  Worsley 

0.94982 

0.94253 

0.98908 

0.93828 

2nd  Worsley 

0.94982 

0.94253 

0.98908 

0.93828 

Gamma  2 

0.94982 

0.94326 

0.98910 

0.93955 

Gamma  3 

— 

0.94724 

0.98957 

0.94351 

Gamma  4 

— 

0.94915 

0.98982 

0.94615 

Gamma  5 

— 

0.94989 

0.98992 

0.94793 

Upper  Bounds 

on  the  Confidence 

Level 

2nd  Bonferroni 

0.94982 

0.95627 

0.99037 

0.96290 

Galambos 

0.94982 

0.95627 

0.99037 

0.96290 

Exact  level 

0.94982 

0.94989 

0.989 92 

0.95050 

TABLE  2 


As  given  in  Table  lb,  Dunnett  (1955,  1964)  presented  tables  for  the 
first  and  second  problem  presented  here.  The  following  values  for  c 
were  taken  from  the  1964  article  and  were  corrected  for  unequal  sample 
size  as  instructed  in  the  article.  These  sample  sizes  were  selected 
to  provide  allocations  as  given  in  Dunnett  (i.e.,  n^/n^  ~  /t)  and  thus 


should  be  exact 
is  more  general. 

for  these  cases 

even  though  the 

method  described 

here 

t 

5 

5 

9 

9 

n0,ni 

4,  2 

10,  4 

6,  2 

6,  2 

d.  f. 

8 

24 

14 

14 

confidence  level 

0.95 

0.99 

0.95 

0.99 

c 

3.2000425 

3.43876 

3.1764266 

4.0217 

Lower  Bounds  on 

the  Confidence 

Level 

1st  Bonferroni 

0.93609 

0.98307 

0.93908 

0.98879 

Sidak  (Gamma  1) 

0.93765 

0.98935 

0.94070 

0.98885 

1st  Worsley 

0.94235 

0.98958 

0.94219 

0.98917 

2nd  Worsley 

0.94235 

0.98958 

0.94219 

0.98917 

Gamma  2 

0.94311 

0.98961 

0.94332 

0.98921 

Gamma  3 

0.94616 

0.98978 

0.94529 

0.98949 

Gamma  4 

0.94776 

0.98988 

0.94675 

0.98969 

Gamma  5 

0.94841 

0.98993 

0.94783 

O.98985 

Upper  Bounds  on 

the  Confidence 

Level 

2nd  Bonferroni 

0.95182 

0.99000 

0.95308 

0.99050 

Galambos 

0.95182 

0.99000 

0.95308 

0.99050 

Exact  level 

0.94841 

0.98993 

0.94960 

0.999011 

c  (from  Games) 

3.  342 

3.465 

3.261 

4 .084 
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ABSTRACT 

Diagnostic  measures  of  the  joint  influence 
of  subsets  of  data  points  in  regression  are 
easy  to  define  but  have  lacked  an  intuitive 
interpretation.  Measures  of  influence  for  a 
single  case  are  easy  to  interpret  in  terms  of 
the  position  of  the  observation  in  the  spaces 
spanned  by  the  columns  of  the  X  matrix  and  the 
(orthogonal)  residual  column  £=Y-Y.  We  present 
an  orthogonal  decomposition  of  the  joint 
influence  measures  in  terms  of  the  single  case 
measures  of  an  equivalent  set  of  orthogonal 
pseudopoints.  The  decomposition  allows  an 
intuitive  interpretation  of  the  joint  influence 
measures  and  leads  to  the  derivation  of  the 
distribution  of  the  measures  under  the  usual 
assumptions.  We  illustrate  the  use  of  the 
decomposition  in  characterizing  data 
configurations  which  contain  influential  points 
and  in  analysis  of  data. 

1.  Introduction  with  Example. 

In  recent  years,  methods  to  diagnose 
influential  observations  in  regression  analysis 
have  received  much  attention  in  the  Statistics 
literature.  In  data  analysis  we  often 
encounter  subsets  of  data  points  which  are 
jointly  influential,  although  not  individually 
so.  That  is,  simultaneous  perturbation  or 
deletion  of  all  the  cases  in  the  subset  leads 
to  substantial  changes  in  the  estimated 
regression  coefficients,  although  perturbation 
or  deletion  of  single  cases  from  the  subset 
leads  only  to  small  changes  in  the  results. 

This  can  be  easily  illustrated  with  the 
adaptive  scores  data,  initially  reported  by 
Mickey,  Dunn,  and  Clark  (1967),  and  more 
recently  analyzed  in  Cook  and  Weisberg  (1980). 
These  data  contain  at  least  three  unusual 
points:  cases  19,  2,  and  19,  Cases  2  and  IS  are 
a  highly  influential  pair,  as  evidenced  by  the 
change  in  the  estimated  regresion  line  due  tc 
deleting  these  two  cases  (see  Figure  1>. 

Several  diagnostic  measures  of  joint 
influence  have  been  developed  (Cook  and 
Weisberg  1992;  Belsley,  Kuh,  and  Welsch  I960' 
as  generalizations  of  single-point  diagnostics. 
The  usual  presentations  of  these  measures, 
however,  lack  an  intuitive  interpretation  that 
leads  to  as  understanding  of  what  kinds  of 
arrangements  of  points  are  highly  influential. 

The  joint  influence  of  cases  2  and  18  in 
this  data  set  is  further  illustrated  by  the 
individual  and  joint  statistics  for  these  two 
observations,  summarized  in  Tables  1  and  2 
below.  From  Table  1  we  notice  that,  although 
the  value  of  Cook's  distance  is  highest  for 
case  18,  it  is  well  below  the  "flag*  value  of 
1.0.  These  cases  are  likely  to  go  unnoticed  on 
the  basis  of  the  individual  case  statistics. 


AdaOtSve 

Score 


'able  1 

Case  Stam'.ics  for  Obseruations  2  and  18.  Adaptive  Score;  Pata. 
'oil  Pata  Set. 


Case  Cook's 


Number 

h 

1  i 

(e-  c,  . 

-  i  m 

r\ 

2 

0.15 

-9.51  -11.26 

“.08 

18 

0.65 

-5.54  -5S.33 

5.63 

5*  *  1 

itav.sucs 

1.02 

Table  2 

far  Ohseruatirs  2  and  13. 

Adas  t  we  Scores  Oats. 

r 

n 

deduced  Data  Sets. 

: !  l 

Case  2  0.4?  -la.?1  -  2Y.78  1.0! 

'delete  case  IS 


Case  IS  0.76  -9.fi  -  Jtll  t -3C 

(delete  case  ?' 


& 

I 


The  value  of  Cook's  distance  for  each  of  these 
cases  when  the  other  one  is  deleted  (Table  2) 
is  much  higher,  further  evidencing  that  these 
cases  reinforce  each  other  in  their  effect  on 
the  estiaated  regression  line.  There  is  no 
siaple  way  to  combine  the  individual  case 

statistics  to  obtain  the  joint  value  cf  Cook's 
distance,  D-»6.37. 

We  present  a  decomposition  of  several  joint 
influence  measures  as  the  sua  of  the  case 
influence  measures  of  a  set  of  orthogonal 
pseudopoints.  The  pseudopoints  are  equivalent 
to  the  points  under  investigation  in  the  OLS 
sense.  That  is,  if  the  observations  under 
investigation  are  replaced  with  the 
pseudopoints  the  results  of  the  analysis  remain 
the  same.  We  continue  the  discussion  of  the 
adaptive  scores  data  by  giving  the  pseudopcints 
equivalent  to  cases  2  and  IB  (labeled  Z  •  and 
Z^0',  and  the  value  of  their  case  influence 
measures,  in  Table  3  below. 


individual  Case  Statistics  for  Cbserjatior.s  2  and  IB 
•ouivalent 


Case  wt . 

X, 

X 

Y 

h. . 

n 

d,n 

1 

18 

i 

1 

42 

57 

0.65 

-5.54 

-15.32 

0.079 

2 

i 

1 

26 

71 

0.15 

-9.57 

-11.26 

0.67 

V 

i 

1.331 

49.08 

81.85 

C.794 

-9.09 

-44.12 

6,365 

Z2 

i 

0.478 

5.58 

39.89 

0.012 

-6.34 

-6.36 

0.002 

\ 

(1.331)2  j 

36.87 

61.49 

0.794 

-6.83 

-32.97 

6.365 

Q 

(0.478) 

1  1 

11.67 

83.45 

0.012 

-13.26 

-13.54 

0.002 

The  most  important  feature  of  Table  3  is  that 
the  values  of  Cook's  distance  for  the 
pseudopoints,  denoted  Z  0  and  Z  « ,  add  up 
correctly  to  6.37.  Also  note  tfiat  the  large 
value  of  D.  for  the  pseudopoint  Z,°  implies 
that  this  point  would  be  identified  by  common, 
single  row  diagnostic  techniques.  Since  the  X 
column  entries  for  the  pseudopoints  are  not  1  c 
these  two  points  cannot  be  plotted  on  the  same 
axes  as  the  remaining  observations.  We  have 
therefore  appended  two  more  rows  to  Table  3, 
labeled  0^  and  0,,  also  equivalent  to  the 
pseudopoints  and  to  cases  2  and  18.  The 
equivalence  is  in  the  sense  that,  if  0.  and  0- 
replace  cases  2  and  18  in  the  original1data  sit 
and  WLS  is  applied  (with  weights  as  indicated 
for  0.  and  0,,  and  weight  1  for  the  remaining 
cases!  the  sime  estimates  are  obtained.  This 
algebraic  identity,  as  well  as  the  individual 
case  statistics  in  Table  3,  were  verified  by 
computation.  The  interpretive  value  of  this 
approach  is  illustrated  in  Figure  2.  This 
figure  is  a  plot  of  the  data  with  cases  2  and 
18  replaced  by  0.  and  0,.  We  can  clearly  see 
in  this.picture  that  0.,  with  weight 
(1.331)  «1.9,  would  control  the  fit. 


Subsets  of  data  points  which  are  jointly 
influential  will  usually  be  equivalent  to  a  set 
of  pseudopoints  of  which  at  least  one  is 
individually  influential. 

Ada:!iv» 

Sct^e 


riqu'e  2.  Scatte'oiv*  of  Adaptive  scorei  Cats. 

“he  eo<. 0.  and  Q,  “sevdosoiits 
Keplacinc  Case?  ?  If. 


2.  Notation. 

We  define  an  augmented  data  matri::,  of 
dimensionality  (n+k'xp.  The  first  k  rows  are 
assigned  to  the  points  whose  joint  influence  is 
being  investigated.  We  refer  to  these  points 
as  kevooints.  and  denote  them  by  Z.  The 
remaining  data  points,  not  under  investigation, 
are  the  n  rows  nf  the  reduced  data  matrix,  V, 

In  a  data  analysis  situation  we  would  permute 
the  rows  of  the  data  to  the  appropriate 
position. 


.  *1.  [X*!Y*1=  *j 

vj  lx  Y  J 


“X* !  Y*7  O 


We  let  (3*  denote  the  regression  estimator 
operating  on  the  full  data  matrix,  and  t 
denote  the  regression  estimator  operating  on 
the  data  points  in  V.  Therefore,  using  the 
notation  introduced  above,  we  have 

Z*  =  <X*'X*>-1 (X*'Y*'  < 2 ' 

Z  =  <X'X'”* fX'Y)  (3' 

The  vector  of  residuals  from  OLS,  £a+J(,  on 
the  augmented  data  matrix,  V*,  is 

=  < a k'  fjj*)  *  (Y*  -  X*(3*> ( 4 > 

We  define  the  vector  of  predicted  resiouajs, 

d  ,  as  residuals  from  the  fit  based  on  u , 

— ntk 

d  •  «  (d  '  d  '*  »  (Y*  -  X*B  >*  (5) 

k  ~n 


si 

««*. 

iff 


i 

$ 


We  let  s**  denote  the  estimated  residual 
variance  from  OLS  in  the  augmented  data  matrix , 
V*,  and  5s  denote  the  estimated  residual 
variance  from  OLS  on  That  is, 


s*£  =  - £k+n'£k+n  '6) 

(k+n-p i 


Two  frequently  used  matrices  are: 


3  =  X*(X'X>  X*' 


3  =  X  (X'X>  X,' 

z  r  s 


<  3  > 

=  TOT'  (9) 


The  matrix  3  is  at  augmented  hat  matrix. 
Partitioning  it  as 


,  ■  [3-  V1 

3Z  3v  J 


(Note  that  3, = 


shows  that  3  =X<X'X"  X  is  the  hat  matrix  for 
the  data  points  it  V.  The  elements  of  G,;  and 
3,,  g. .  for  i=l  to  n+k  and  :=1  t:  n,  are 
interffetatle  as  the  rate  ^5  change  ^n  the  i  " 
OLS  fitted  value  based  on  ,  y,  =  ijl.  with 
resoect  to  yrf,  The  diagcna-  elements  of 
provide  a  measure  of  the  distance,  in  X-space, 
between  the  kevpcints  and  the  centroid  of  the 
remaining  observations  relative  to  a  sca.e  and 
orientation  determined  by  the  remaining 
observations. 

The  reader  can  verify  that 

=  (I+3z>e.<  (11  > 

We  stress  that  T,  9  and  0  are  functions  of  the 
kevpcints. 

3.  A  Brief  Review  of  Joint  Influence 
Diagnostics. 

Three  approaches  have  been  used  in  the 
development  of  multiple  row  diagnostics: 

a.  differencing  :f  ^  with  respect  to  the 
presence  or  absence  of  the  keypoints 

i traditionally  labeled  "case-deletion" > , 

b.  differentiation  of  ..  with  respect  to 
weights  assigned  to  the  keypoints 
■'traditionally  labeled  "differentiation"), 

c.  ratio  of  data  space  volumes  calculated 

in  the  presence  and  absence  of  the  keypoints 
i traditionally  labeled  "geometric 
interpretation" > . 

The  differencing  and  ratio  approaches  are  case 
deletion  methods,  the  differentiation  approach 
is  a  continuous  fora  ofcase  deletion. 

After  a  healthy  dose  of  algebra  we  have 
obtained  general  forms  for  measures  based  on 
each  of  these  approaches  in  terms  of  the 
predicted  residuals  for  the  keypcints  and  the 
eigenvalue  decor.iosition  of  3  .  These  forms, 
as  well  as  example  of  well-knSwn  diagnostics  of 
each  type  are  summarized  in  Table  4. 


4.  Orthogonal  Pseudopoints 

The  consistent  appearance  of  Td^  in  the 
expressions  in  Table  4  suggested  a  data 
transformation  as  indicated  below: 

~C  M  •  Cl  -  K1 

This  transformation  leaves  the  reduced  data 
matrix,  V,  unchanged  and  replaces  the 
keypoints,  Z,  with  a  set  of  pseudopoints, 

Ze  =  TZ.  The  pseudopoints,  Z°,  are  equivalent 
to  the  keypoints,  Z,  in  the  OLS  sense.  That 
is,  the  transformation  dees  not  change  the 
results  of  the  OLS  analysis. 

'able  4 

Selected  Joht  .Wlce.ice  diagnostic  Measure; 

Examples 


S^erescioe: 

Ck  =  /is*-8!':x*- >(*)(§*-&) 
*  d/Tea-W'Vd 


f?*-8rfx'xu$*4i 


Pif^gTgr-tiation; 

(*(?«)?'  <»(#(«) > 

\  - - - 

Jw  w *1 

^"TKWfV^  (15) 

Rat : c  of  Data  Space  Volume;: 

det(  v*"#) 

*  det  <V"J' 


D.  =  ",/(ps*g)C 


MCFFIT  =  i 


BKWCJ81.D.36) 


Andrew;  and  &refli?on 
statistic 


4'Tt(!*e)  T% 


'•det  <:*SJ  (16) 


The  pseudopoints,  Z°,  are  orthogonal  with 
respect  to  both  the  (X'X)  1  and  (XA'X*)  1 


innerproducts.  Also, 

3*  =  TG  T'  =  0 

z  z 

(17) 

Sk*  -  r*k 

(13) 

4'  -  • 

(19) 

Substitution  of  (17)  to  (19) 
in  Table  4  leads  to  further 

into  the  equations 
re-expression  of 

the  joint  influence  diagnostics  for  the 
keypoints  as  weighted  sums  of  the  case 
influence  diagnostics  for  the  pseudopoints. 
These  re-expressions  are  given  in  Table  5. 


Tablt  5 

Join;  Influence  Haasures  as  Functions  of  the  Orthogonal 
Pifugopomfi 


1  1  «.  -i  . 

( - K  =  f - ■)2*(1+*  )  -d*» 

k*s2  *  k«s*  1  1  1 


=  (— hZc?  . 

k-s*  *>' 


( — )W.  = 
k-s*  k 

* 

1 

1 

( — = 
k'S*  k 

( — >rMi+M 

i.c2  “  1  1 

= 

(— : )TK 

k'5 

k  Lt  k  ’  '  ? 


611  sums  and  products  are  for  i=l  to  k. 


>  '23' 


The  expressions  given  in  Table  5  differ 
slightly  from  those  given  in  Table  4  since  they 
are  rescaled  (dividing  by  k-sJ)  to  make  them 
unitless  and  independent  of  the  number  of 
keypoints,  k.  These  measures  have  several 
important  properties: 

■  Each  numerator  is  the  sum  of  the  single 
keypoint  diagnostics  for  the  k  pseudcpcints. 

■  Each  numerator  is  the  weighted  sum  cf  the 

single-point  predicted  residualE  for  the 
pseudopoints.  The  weights  are  simple 
functions  of  the  augmented  leverages  of  the 
pseudopoints.  These  measure  the  distance  of 
the  pseudopoints  from  the  centroid  of  the 
initial  points.  The  distinctions  between  the 
diagnostic  measures  are  in  the  functional 
form  of  the  weights.  _a 

•  The  weights  are  of  the  form  <4, -(l+Oj'  ), 

where  the  values  of  m  generate  a  ladder  of 
powers.  It  thus  appears  that  the  choice  cf 
crossproduot  matrix,  M,  and  the  use  of  a 
differentiation  vs  a  differencing  approach, 
result  in  changing  the  power  of  the  <1+4. ) 
terms.  This  suggests  that  other  values  of  m, 
net  necessarily  integers,  may  provide  useful 
diagnostics.  This  needs  to  be  studied 
further. 

•  The  denominators  of  the  measures  define  a 
scale  based  on  only  the  initial  data  points. 

It  follows  from  the  first  two  propertic.  that  a 
jointly  influential  set  of  points  will 
typically  have  at  least  one  equivalent 
pseudopoint  that  is  individually  influential. 


5.  The  Distribution  of  Joint  Influence 

Measures. 

Under  the  usual  normality  and  independence 


assumptions  of  regression  analysis,  the  vector 
cf  residuals,  e  ..  fellows  a  normal 
distribution  with  mean  zero,  0,  and  variance 
covariance  matrix  (I-H'c*.  From  equation  (11) 
it  follows  that  the  "ector  of  predicted 
residuals,  (d  ,  d,  °>,  follows  a  normal 
distribution with  mean  zero,  0,  and  variance 
covariance  matrix  K,  given  by 


-G.  2G2T  "I 
!TG  '  I  +  oj 


The  numerators  of  each  of  the  measures  (13) 
through  <15 1  in  Table  4  are  of  the  form 
d.  s'Md,  therefore  each  of  these  numerators 
can  be  written  as  the  sum  cf  k  independent 
chi-squares  with  one  degree  of  freedom.  That 
is, 

-k°'K4s  =  2*<^*i'11  <25' 

where  the  a('s  are  the  non-zero  eigenvalues  of 
Md+C'  and  */  *(r'  denotes  a  chi-square 
variable  with  r_£egrees  of  freedom.  For  measure 
<13'  M  =  0£J+T>  and  a.  =  4>;,_£cr  measure  (14) 
M  =  0(1+0)  and-3a.  =  i^'l+il)  ,  fcj  measure 
<15!  M  =  CC+S>‘  ind  a*  =  ik<l  +  i>,''  . 
Similarly,  the  denominator  of  each  of  these 
measures,  i  'd_ ,  is  a  chi-square  with  n-p 
degrees  cf~freedom. 

The  measures  (13)  through  (15)  are  appealing 
in  that  they  are  "F=like",  however,  since  K, 
given  in  (24)  is  not  block-diagonal,  the 
numerator  and  the  denominator  are  not 
independent. 

Work  by  Guriand  (1353>  and  others  on  the 
distribution  of  indefinite  quadratic  forms  can 
be  used  to  determine  the  cumulative 
distribution  of  measures  cf  the  form  given  ir. 
(13)  through  '15'  in  Table  4,  as  follows,  We 
have 

1  'Md. 

Pr  (  — k - >  t)  (26' 


3  'M  * 

=  Pr  (  .TBtk  "1-nik  >  *  > 

-n+k  "z-n+k 


\Pr  '  W{M1 

where 


•  V4+k  >  °»  <Z7> 


TO  O' 

K  = 

lC  M. 

’T  A' 

M2  =  ^  1 

1  Lo  0. 


The  matrix  OK. -t-K.)  is  non-def inite.  The 
probability  i5  <26t  can  be  written,  again 
following  Baldessari  (1967),  as  the 


Pr  ^!;,rj1  > 


where  the  a/s  are  the  s  distinct  eigenvalues 
of  W(M.  -t'M^f,  the  chi  squares  are  mutually 


independent,  and  r.  is  the  multiplicity  cf  a.. 

Work  along  these  lines,  aimed  at  calculating3 

the  cumulative  probability  distributions  of  the 

measures  in  Table  4, is  being  pursued. 
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ABSTRACT 

We  present  a  very  easy-to-program  Method  of 
generating  random  samples  from  arbitrary 
discrete  or  continuous  aultivariate 
distributions.  A  Markov  chain  is  constructed 
such  that  its  equilibrium  distribution  is  the 
desired  distribution.  The  observations  come 
from  running  the  process  until  equilibrium  is 
reached. 

1.  INTRODUCTION 

Generating  random  samples  from  a  given 
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f(r)  =  J..J  f(s)dT(£|g)  (2.1) 

seS 

where  r  is  fixed  and  integration  is  over  s  in 
the  state  space  S.  For  a  continuous 
distribution,  the  distribution  function  for 
transitions  from  s  to  £  has  an  atom  of 
probability  at  s  =  r.  In  this  case  the 
distribution  function  for  transitions  from  g  to 
r  satisfies 

T(r|s)  =  PgVEla)  +  d-Pg)  Ig<£> 

where  Tc  is  a  continuous  distribution  function 
and  IB(r)  =  I  if  R  £  Sj  for  all  i  and 

I  (£>  =  0  otherwise. 


distribution  is  an  essential  part  of  any 
simulation  study.  There  are  many  very  general 
methods  of  generating  randan  samples  from 
univariate  distributions.  However,  generalizing 
these  techniques  to  aultivariate  distributions 
is  difficult,  and  few  algorithms  have  been 
developed.  Hidden  in  applied  fields  and  Monte 
Carlo  literature  is  a  very  easy  to  program 
method  of  generating  random  samples  from 
arbitrary  aultivariate  distributions  (see 
references  1,2, 5, 6, 7  and  8).  This  method  has 
been  used  to  generate  observations  from  certain 
spatial  stochastic  processes  (Ripley  1979), 
although  it  has  not  been  documented  in  its  full 
generality  with  a  careful  proof  in  the 
statistical  literature. 

The  idea  behind  this  approach  is  to  construct 
a  discrete  or  continuous  state  space  Markov 
chain  such  that  its  equilibrium  distribution  is 
the  desired  distribution.  Each  observation  of 
the  desired  multivariate  distribution  is  then  an 
observation  from  the  equilibrium  distribution  of 
this  Markov  chain. 

2.  THE  BASIC  METHOD 

The  method  described  below  works  for  either  a 
continuous  or  discrete  distribution.  For  a 
continuous  distribution  the  Markov  chain  is 
unusual  in  that  the  state  space  is  continuous, 
but  there  are  atoms  of  probability  corresponding 
to  no  change.  This  requires  a  more  complicated 
proof;  the  discrete  case  is  a  simpler  case.  The 
statements  and  proofs  given  below  are  for 
continuous  distributions.  Let  f(r),  £  in  Rn  be 
the  probability  density  from  which  the 
observations  are  desired,  and  let  T(£|g)  be  the 
transition  probability  distribution  function 
from  g  to  £  of  a  Markov  chain.  The  standard 
condition  for  such  a  chain  to  have  f(£>  as  its 
equilibrium  distribution  is  that 


Lesmw  2.1.  A  sufficient  condition  for  f(r)  to 
be  the  equilibrium  distribution  of  the  Markov 
chain  described  above  is  that 

V£la>f<a>pg  =  tc(sl£)f(£)P£  (2-2) 

for  all  £  and  s  in  S,  where 

tc(r|s)  =  T  Tc(£l£> 
dr 

Proof*  Cons ider 

J...J  f(s)dT(£|g) 
geS 

=  /•*•/  fteHc{£lS>Pg<«S  + 

stS 

=  J...J  f(£)tc(s|r)prdS  +  (l-pr)f(r) 
giS 

=f(£> 

We  will  now  create  a  Markov  chain  satisfying 
(2.2)  for  which  the  equilibrium  distribution  is 
the  desired  distribution.  Suppose  for  each  £ 
in  S,  it  is  easy  to  generate  an  observation  £ 
with  probability  density  g(£|s)  which  has  a 
support  containing  S.  We  then  generate  an 
observation  £  with  density  g(£|s)  and  move 
from  g  to  £  with  probability  a(£|g); 
otherwise  we  remain  at  g  with  probability 

!-{•••/  *(£la>®(£lS>dE  =  1_Pg 
This  is  similar  to  a  rejection  technique; 
however,  if  the  generated  observation  is  not 
accepted,  then  the  chain  remains  at  its  current 
location. 

Theorem  2.1.  For  the  processes  described  above, 
let 

a(£|g)  =  min{q(£|g),l) 

where 

9(£lt>  =  «(&lE>f(£)/[«(£lS)f<S>)- 
Then  the  resulting  transition  distribution  for 
the  Markov  chain  satisfy  (2.2)  and  the 
equilibrium  distribution  is  f(£). 


Proof.  Observe  that  q(r|s)  =  l/q(s|£).  Thus  if 
q(£|j)  2  1,  then  q(g|£>  $  1.  Consider,  without 
loss  of  generality,  the  case  where  q<£|g)  £  1. 
Then  a(£|s)  =  1  and  a(g|£)  =  q(g|£).  Also 

tc(£ls)  =  «(£l*).l/Pa 

and 

tc(gl£)  =  «<Sl£>-9<Sl£>/Pr 

Thus 

Prtc(s|£)f(£)  =  g(sjr)q(8|£)f(£) 

”  =  «(,.£)  f(£) 

«(»l£)f(£) 

=  *(£l»)f(®) 

=  Pstc(£lg)f(s) 

The  conclusion  follows  by  Leaaia  2.1. 

3.  THE  ALGORITHM 

The  desired  observation  has  density  f(s).  It 
is  easy  to  generate  observations  with  a  density 
g(r|s)  for  each  value  of  S.  Equlibriua  is 
assuaed  to  be  reached  after  NEQ  nuaber  of  steps 
of  the  Markov  chain. 

STEPP.  Set  COUNTER  equal  to  0. 

STEP! ■  Generate  an  observation 

s  froa  soae  initial  distribution. 
STBP2.  Generate  an  observation  r  with  density 

*<£l2) 

STEP3.  Increaent  the  COUNTER 
STEP4.  Calculate  q(r|s) 

=  g(s|£)f(£)/[g(£|a)f(s)] 

STEP5.  If  q(r|s)  i  1  then  set  a  =  r  and 
go  to  step  8. 

STEP6.  Generate  U  froa  the  unifora 
distribution  on  (0,1). 

STEP7.  Tf  U  <  q(r|a)  set  s  =  £. 

STKP8.  If  COUNTER  <  NEQ  then  go  to  step  2. 
STEP9.  Deliver  s 

The  resulting  observation  s  has  density  f(s) 

( approx iaate ly ) .  To  generate  sore  observations 
the  COUNTER  is  set  equal  to  0,  but  the 
algoritha  is  started  again  at  Step  2. 

4.  PERFORMANCE  OF  THE  ALGOR  I TW 
In  practice,  the  process  is  run  a  finite 
nuaber  of  tiaes  to  yield  an  observation  which 
has  approxiaately  the  equilibriua  distribution. 
However,  the  rate  of  convergence  of  Markov 
chains  is  known  to  be  geoaetric  and  therefore, 
only  a  few  steps  in  the  Markov  chain  will  be 
necessary.  We  have  tried  running  the  process  5 
to  15  tiaes  for  the  bivariate  dirichlet,  the 
bivariate  binoaial,  the  bivariate  log-series  and 
the  bivariate  noraal  distributions,  and  have 
found  that  ten  observations  are  enough  to  reach 
equilibriua.  The  location  (centering  of  g(j[|s) 
is  auch  aore  iaportant  than  the  shape.  We  have 
used  a  distribution  g(£|s)  =  TTg(Rj)  of  the 


product  of  independent  randoa  variables  (not 
depending  on  {). 

We  used  the  algoritha  to  generate 
observations  froa  the  bivariate  binoaial,  the 
bivariate  log-series  and  the  bivariate  Dirichlet 
distributions.  In  each  case  we  generated  10 
sets  of  1000  observations,  and  used  the 
chi-square  goodness-of-f it  test.  In  every  case 
the  nuaber  of  steps  to  reach  equilibriua  was 
taken  to  be  10.  There  were  no  saaples  rejected 
at  a  5%  level. 

Additionally  10,000  observations  were 
generated  for  each  of  these  distributions,  and 
the  chi-square  goodness-of-fit  test  was  done. 

The  results  given  below  show  the  algoritha  works 
quite  well. 

4.1  The  Bivariate  Binoaial  Distribution.  Let 

Yj,  X2  and  Xg  be  independent  identically 
distributed  Bernoulli  random  variables  with 
P  =  1/3,  and  let  X  =  Xj  +  X2  and  let 
Y  =  Xj  +  Xg.  Then  (X,Y)  has  a  bivariate 
binomial  distribution.  We  used  the  unifora 
distribution  on  the  7  possible  values  for  the 
initial  distribution  as  well  as  for  g(r|s).  The 
observed  and  (expected)  frequencies  for  10,000 
observations  are  given  in  the  following  table: 


x\y 

0 

1 

2 

0 

2973 

(2962) 

1513 

(1482) 

— 

1 

1497 

(1482) 

2232 

(2222) 

680 

(741) 

2 

734 

(741) 

371 

(370) 

The  chi-square  value  is  5.98. 

4.2.  The  Bivariate  Los-Series  Distribution. 

This  distribution  is  a  discrete  distribution 
with  unbounded  support.  Its  density  is 

P(Xj  -  Xj,  X2  =  x2) 


0,Xi  09xs  (x,  +  x9  -  1)! 


-ln(l-8j-02)  x j !  x2 ! 


-,  x,+x9>l, 


where  Xj  =  0,1,...;  0  <  0^  <1  and  0j  t  02  <  1. 

Wc  generated  observations  froa  the  bivariate 
log-series  distribution  with  '  02  r 
using  the  product  of  independent  Pascal 
distributions  with  paraaeters  k  -  1  and  p  = 
1-1/e  for  the  initial  distribution  as  well  as 
for  g(£|s).  The  observed  and  (expected) 
frequencies  for  10,000  observations  are  given  in 
the  table  below. 


Hi 

& 


& 


i 

4 


‘Si 


1 


•||  'il  ’j*,'!!* 


1».  4- a  I*. 


|4|  §’«  ■  |‘|  I 


0 

1 

2  3  4 

— 

3994 

(3915.2) 

368  53  10 

(391.5)  (52.2)  (7.83) 

3894 

791 

174  27  1 

(3915.2) 

(783.1) 

(156.6)  (31.3)  (6.3) 

373 

141 

47  8 

(391.5) 

(156.6) 

(46.9)  12.5) 

43 

29 

13 

(52.2) 

(31.3) 

(12.5)  9 

13 

5 

(13.4) 

(7-8) 

(6.3) 

The  chi-square  value  is  22.49. 

4.3  The  Bivariate  Dirichlet  Distribution.  This 
distribution  is  continuous  with  bounded  support. 
Its  density  function  is 

0-1  0-1  0,-1 
f(yt,ya)  =  C  yt‘  y22  (1-yj-yj)  3  , 

where  0  <  <  1,  yt  +  y3  <  1,  and  0i  >  0  . 

We  generated  observations  fron  the  bivariate 
Dirichlet  distribution  with  0  =0  =  0_  =  2.0 

the  uniform  distribution  on  the  triangle  bounded 
by  X  =  0,  Y  =  0  and  X  +  Y  =  1  for  the  initial 
distribution  as  well  as  for  g(r|s).  Five 
equiprobable  strips  with  sides  paralles  to  X  +  Y 
=  1  were  divided  into  10  equiprobable  regions  by 
the  line  X  =  Y.  The  observed  and  expected 
frequencies  for  these  10  regions  are  given 
below. 


Region 

1 

2 

3 

4 

5 

observed 

freauencv 

1059 

1012 

1012 

991 

1003 

expected 
f  reauencv 

1000 

1000 

1000 

1000 

1000 

Region 

6 

7 

8 

9 

10 

observed 

freauencv 

937 

976 

999 

978 

1033 

expected 

frequency 

1000 

1000 

1000 

1000 

1000 

The  chi-square  value  is  9.98. 

6.  TIMINGS  OF  THK  ALGORITWi 
In  order  to  have  some  idea  of  the  efficiency 
of  the  algorithm  we  tilled  the  algorithn  used  to 
generate  observations  fron  the  bivariate  normal 
distribution,  and  timed  the  generation  of  the 
bivariate  normal  distribution  generated  as  a 
linear  combination  of  independent  normals  each 
of  which  was  generated  using  the  ratio  of 
uniforms  technique  (Kinderman  and  Monahan  1977). 
The  uniform  generator  is  a  version  of  the 
generalized  feedback  shift  register  (GFSR) 
algorithm,  Lewis  and  Payne  (1973)  The  routines 
were  written  in  FORTRAN  and  compiled  using  a 
MICROSOFT  FORTRAN  compiler  on  a  ZENITH  Z152 
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(Z100-PC)  computer.  These  timings  and  the 
timings  to  generate  1000  observations  of  the 
three  distributions,  discussed  in  the  previous 
sections,  are  given  below. 

TIMINGS  TO  GENERATE  1000  OBSERVATIONS  (MIN: SBC) 


Bivariate: 

I 

II 

III 

Normal (RU) 

1:21.46 

1:20.69 

1:20.52 

Normal(MC) 

2:31.21 

2:31.65 

2:35.99 

Binomial(MC) 

0:47.35 

0:47.24 

0:49.10 

Loff-Series(MC) 

1:58.20 

1:59.07 

2:08.63 

Dirichlet(MC) 

0:52.18 

0:52.45 

0:54.81 

I:  l  call  of  1000  observations 

II:  100  calls  of  10  observations 

III:  1000  calls  of  1  observation 


7.  COMMENTS  AND  DISCUSSION 

(1) .  There  were  two  problems  that  occurred 
during  the  use  of  this  algorithm.  Both  are 
basically  computational  problems. 

(a)  If  the  initial  distribution  g(r|s)  is 
too  far  away  from  the  desired 
distribution,  then  a  larger  number  of 
steps  to  reach  equilibrium  is  needed. 

It  is  desirable  to  have  g(r|s)  and 
the  desired  distribution  as  close  to 
each  other  as  possible.  The  shape  of 
g(jr|s)  is  not  as  important  as  the  location. 

(b)  While  calculating  q(r|s),  we  need  to 
take  the  ratio  of  two  quantities.  In 
practice,  if  these  quantities  are  too 
small  or  too  large,  the  problem  of 
overflow  and  underflow  occurs.  We 
suggest  taking  the  natural  logarithm 

of  q(r|s)  in  this  case.  Thus  we  have 
In  a(R|S)  =  min{0,ln  q(r|s))  and  then 
find  a(r|s) . 

(2)  To  implement  this  algorithm  we  need  to  know 
the  density  only  up  to  a  constant. 
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INFERENCE  IN  ROBUST  DISCRIMINANT  ANALYSIS 


Vesna  Luiar 

University  Ccrputing  Centre,  Zagreb 


In  this  paper  a  procedure  for  deter¬ 
mining  asymptotic  distribution  of  the  test 
statistic  in  a  model  for  robust  discriminant 
analysis  has  been  proposed.  The  expressions, 
derived  for  the  general  class  of  elliptical 

populations,  were  found  to  be  asinptotically 
2 

distributed  as  j  .  A  Monte  Carlo  experiment 
has  been  cortiucted  in  order  to  study  sampling 
distribution  of  the  proposed  statistic. 

KE3f  WORDS:  robust  discriminant  analysis; 
asymptotic  distribution  of  covariances; 
elliptical  distributions;  Monte  Carlo. 

1.  IN1HDDUCTT0N 

A  very  simple  model  for  Robust 
Discriminant  Analysis  (HDA)  has  been  proposed 
by  Stalec  and  Mcmirovld  (1984) .  The  method  is 
based  on  maximisation  of  variances  of  among 
group  means  on  mutually  orthogonal  latent 
dimensions,  defined  in  the  space  of  standardi¬ 
zed  group  means  of  a  set  of  not  necessarily 
normally  distributed  variables.  Furthermore, 
for  applying  the  method,  the  condition  of 
regularity  of  initial  variables  need  not  be 
fulfilled. 

Discriminant  functions  produced  by 
the  proposed  algorithm  are  correlated,  but  for 
interpretational  purposes  can  be  transformed 
Into  orthogonal  variables,  using  9ome  factor 
analysis  technique. 

In  this  paper,  a  procedure  for  de¬ 
termining  the  number  of  significant  discriminant 


functions,  defined  by  RDA,  is  suggested.  The 
expressions  for  test  statistics  are  obtained 
for  the  general  class  of  elliptical  populations 
and  for  normally  distributed  variables,  as  a 
special  case. 

Finally,  a  Monte  Carlo  experiment  was 
conducted  in  order  to  study  the  sampling  distri¬ 
butions  of  the  proposed  statistics  for  normal 
and  oontamirated  normal  populations  and  for  3 
different  sample  sizes. 

2.  MODEL  CF  ROBUST  DISCRIMINANT  ANALYSIS 

Let 

Z  =  (Zjj)  i  =  1,...  n;  j  =  1,...  m 

be  the  data  matrix  in  standard  normal  form, 
obtained  by  the  description  of  the  set  of 
subjects  E  =  (e^,  1=1,...  n)  on  the  set  of 
quantitative  variables  V  =  {v^,  j*l, . . .  m). 

Let 

S  =  (sib)  1  =  1'—  mf  k  =  lf—  9 

be  the  selector  matrix,  obtained  by  the  des¬ 
cription  of  the  set  E  on  the  nominal  variable 
N  =  {n^,  k=l, . . .  g},  where  g  denotes  the  number 
of  groups. 

By  the  operation 
T  -1  T 

Q  =  S  (S*S)  A  S  Z  =  PZ, 
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where  P  is  a  projector  matrix,  a  matrix  of  stan¬ 
dardized  group  means  of  variables  from  V,  cen¬ 
tered  to  ccmrcn  zero  mean,  is  defined. 

Matrix  of  covariances  of  variables 

in  Q  is: 

G  =  QTQ/n  =  ZTPZ/n, 

where  G  is  m  x  m  matrix  of  rank  q  =  min  (g-l,m) . 

In  order  to  obtain  robust  discrimi¬ 
nant  variables,  covariances  between  linear 
composites 

L.  =  Qy.  (1) 

and  i  =  1, . . .  q 

Ki  =  Z*i 

have  to  be  stceesivelly  maximized: 

=  L  T  K./n  =  ytT  Gy±  =  max, 

under  the  constraints  of  orthonormality  of 
transformation  vectors: 


i,  j  =  1, ...  q  (2) 

Yj  =  i  *  j 

It  is  easy  to  see,  that  this  formulation  leads 
to  the  characteristic  equation: 

G  Y  =  Yfl  (3) 

where 

Y  =  (yr  y2...  yq) 

and 

q  =  diag  (f^,^. .  .s^)  «1> 


3.  TESTING  HYPOTHESES  IN  ROBUST 
DISCRIMINANT  ANAIYSIS 

In  order  to  test  a  series  of  hypothe¬ 
ses 

Hq-k:  ^k+l=  0 

'X, 

V2=0 

—  0,  k  —  1 , . • *  q— 1 

where  i  =  2, ...  q  are  the  first  q  roots  of 
the  population  matrix  £,  a  modification  of  a 
procedure  proposed  by  Steiger  &  Browne  (1984) 
for  testing  correlations  beetween  optimal  linear 
composites  was  applied.  The  procedure  gives,  in 
fact,  the  conditions  under  which  optimal  corre¬ 
lations  can  be  tested  in  exactly  the  same 
manner  as  simple  correlations  by  replacing  ori¬ 
ginal  observations  by  optimal  oonposite  scares. 
Using  the  following,  well  known  proposition, 
the  procedure  can  be  applied  olso  in  the  case 
of  covariances  and  for  the  nannormal  popula¬ 
tions: 

Proposition  1  (Oook,  (1951)) 

If  T  =  n1^2  (S  -  I) ,  where  S  is  the 
sample  covariance  matrix  constructed  from 
sample  of  N  =  n  +  1  i.i.d.  m-vectors  x^.. .  Xj^ 
with  finite  fourth  moments,  then  the  asymptotic 
distribution  of  T  is  normal  with  mean  zero 
and  covariances  expressed  in  terms  of  cumulants 
of  the  distribution  of  x^  as 


are  matrices  of  eigenvectors  and  nonzero  eigen¬ 
values,  respectively. 


hi.*,  -  -  <V  v  -  -iSMS  «5Mi  ■£ 


Robust  discriminant  variables  are 

given  by: 

K  =  Z  Y, 

with  the  covariance  matrix: 

U  =  KT  K/n  =  YT  R  Y 

which  is  not  orthogonal,  in  general. 


In  the  case  of  elliptical  populations,  the 
expression  for  ^  can  be  simplified  using 
the  standardized  kurtosis  parametar  3k  of  the 
marginal  distributions  of  X^: 

^ij,kh  =  +  ^h^k  +  K  ^‘kh  +  tjlktjh  + 

+  fcihV  (5) 


For  normal  populations  K  =  0  and  covariances  of 


«.*] 


!i»r 


the  elements  of  T  are  given  using  only  first 
two  terms: 


*ij,kh  tiktjh  +  tihtjk 


The  requirements  of  proposition  given 
by  Steiger  S  Browne  (1984)  are  satisfied  if  and 
only  if  the  population  vector  y^(  i  =  1 , . . .  q  is 
differentiable  i.e.  when  the  population  root 
£2^  (i  =  1,...  q)  is  distinct.  Is  must  be  noted 
that  the  proposition  is  not  applicable  if  all 
the  roots  of  G  cure  zero.  Since  the  distribution 
of  the  elements  of  the  matrix  G  is  knovn  from 
proposition  1,  asymptotic  distribution  of  cova¬ 
riances  between  L  =  ZY  and  K  =  PZY  can  be  found 
from  the  sample  covariance  matrix 

LT  L/n  LT  K/nl  [fi  £2 

C  = 

KTL/n  KTK/n  U 


^1  =  fl*U+f2*Q  (7)  (£2=  diag(£2lf£22.  ,.£2q)) 

*2  =  ’(’i  +  22T  ,  (8) 

*  denotes  Hadamard's  product,  and  3  *  is  the 
standardized  kurtosis  of  the  marginal  distri¬ 
butions  of  vectors  from  Z. 

Finaly,  in  order  to  test  hypotesis 

H  .,  a  matrix 
q-x 


is  formed,  where  0  is  q-k  x  k  null  matrix  and 
is  (q-k)  x  (q-k)  identity  matrix.  Using 
Mg_k,  hypothesis  H^_k  can  be  expressed  in  the 
equivalent  form: 

M  ,  ft  =  0. 


L  Z  Y 


K  PZY 


From  matrix  C  and  proposition  1  with  the  use  of 
sane  algebra,  following  result  can  be  obtained: 

PROPOSITION  2 


If  a  =  vec  (£2^,  £22..£2n)  is  the  vector  of 
covariances  between 

Li  =  z  yi 


k,  =  p  z  yt. 


1, .  • .  q  , 


where  y^  are  eigenvectors  of  G,  and  Z  is  data 
matrix  from  elliptical  population,  then  the 
asymptotic  distribution  of 

n^2  (2  -  ^)  >  where  2  is  q  x  1  vector  of 
population  covariances,  is  normal  with  mean  zero 
and  covariances  given  byqxq  matrix  *: 


This,  together  with  the  proposition  2  leads  to 
the  final  result: 


PROPOSITION  3 


If  £2  and  ft  are  qxl  vectors  of  sample 
and  population  covariances  between  and 
as  given  by  (6),  then  the  asynptotic  distribu¬ 


tion  of  statistic 


Tq-k  “  n  (Mq— k*)T  (Mq-k  ♦  Mq-k)_1  '  (9) 


where  i);  is  consistent  estimate  of  i),  is  x  with 
q-k  degrees  of  freedom.  In  the  case  of  normal 
populations,  (9)  can  be  expressed  as: 

^-k  =  n  (Mq-k-n)T  (Vk  \  Mq-k)_1  'Vk^  ' 

where  is  consistent  estimate  of  as  given 
by  (7). 


4.  M3NTE  CARLO  EXPERIMENT 

In  order  to  study  the  sampling  dis¬ 
tribution  of  the  proposed  test  statistic,  a 
Monte  Carlo  experiment  was  conducted  for  t 
types  of  elliptical  distributions:  normal  and 
contaminated  normal  (e  =.  1,  a  =  3) .  Nintber 
of  variables  and  nattier  of  groups  ware  fixed 
to  five  and  four,  respectively.  All  four  groups 
were  of  the  same  size.  Three  diferent  sample 
sizes  per  group  were  used:  n^  =  20,  50,  100. 

Fbr  each  case,  200  replications  were  generated. 

Model  matrix  of  group  means  was: 


The  experiment  was  conducted  through 
the  use  of  package  for  generating  random  ma¬ 
trices  developed  by  UdSar  (1981) . 

Table  1  shows  empirical  type  I  error 

rates  according  to  T^  and  T^,  as  ccrpared  to 

nominal  rates  a  =  .050,  .025  and  .01,  for  all 

6  analysed  cases.  In  general,  empirical  rates 

for  a  =  .05  and  a  =  .025  for  normal  populations 

are  higher  than  the  nominal  ones.  Ctt  the  other 

hand,  sampling  distribution  of  T^  statistic 

a justed  far  nonzero  kurtoses, is  shorter  tailed 
2 

than  the  x  (1)  distribution.  Deviations  from 
the  theoretical  distribution  could  be  due  to 
truncation  from  below  of  the  normal  distribu¬ 
tion  of  the  anallest  nonzero  root  of  the 
positive  semidefinite  matrix  G.  unfortunately, 
distribution  of  the  quadratic  forms  in  trunca¬ 
ted  normal  variables  is  not  easy  to  obtain. 
Anyway,  in  using  (T^^)  statistic  in 

determining  the  nonber  of  significant  robust 
discriminant  variables,  one  should  bear  in  mind 

that  the  sampling  distribution  in  question  is 

2 

shorter  tailed  than  the  x  distribution. 


Olso,  it  should  be  pointed  out,  that 
additional  experiments,  with  different  model 
matrices  should  be  conducted,  in  order  to  ocme 
to  general  conclusions  concerning  sampling 
distribution  of  T^_k  (T^,)  statistic. 

5.  CCNCUJSIONS 

The  intent  in  this  paper  has  been  to 
suggest  the  method  for  determining  the  nonber 
of  significant  discriminant  variables  in  RDA. 
Proposed  statistic,  obtained  for  general  class 

of  elliptical  populations,  is  asymptotically 
2 

distributed  as  x  • 

A  Monte  Carlo  experiment  brought  to  some 
conclusions  concerning  the  sanpling  distribution 
of  the  proposed  statistic. 
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TABLE  1.  Number  of  times,  out  of  200,  T?  (T. ) 

i  2 1 

exceeded  the  a-th  percentile  of  x  (1) 
(empirical  type  I  error  rate) 


a  =  .050 

a  - 

=  .025 

a  = 

■  .010 

NORMAL  T^ 

n.=20 

CONTAMINATED 

17  (.085) 

7 

(.025) 

i 

(.005) 

NORMAL  (k  =  1.78)^ 

7  (.035) 

2 

(.010) 

0 

(.000) 

NORMAL  T^ 

^=50 

OCN3AMINA3ED 

10  (.050) 

3 

(.015) 

0 

(.000) 

NORMAL  (  <  —  1.78)T 

7  (.035) 

1 

(.005) 

0 

(.000) 

NCRMtf, 

0^=100 

CONTAMINATED 

12  (.060) 

7 

(.035) 

1 

(.005) 

NORMAL  (  <  =  1.78)T, 

5  (.025) 

1 

(.005) 

0 

(.000) 
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EXACT  CONFIDENCE  BOUNDS  FOR  PROPORTIONS 


Donald  L.  Marx,  University  of  Alaska,  Anchorage 


INTRODUCTION 

The  construction  of  confidence  interval  estima¬ 
tes  for  proportions  based  on  a  random  sample  from 
an  infinite  population  is  one  of  the  earliest 
inferential  techniques  learned  by  statistics 
users.  The  normal  approximation  is  so  well  known 
that  it's  accuracy  is  seldom  questioned  for 
samples  larger  than  Size  30.  Charts  and  tables 
are  available  for  constructing  exact  confidence 
bounds  for  smaller  samples.  A  BASIC  computer 
program  for  constructing  exact  confidence  bounds 
is  presented  in  this  paper. 

There  are  several  advantages  of  computerizing 
the  exact  confidence  bounds  procedures.  For 
example,  any  sample  sizes  and  confidence  levels 
can  be  considered;  one  is  not  restricted  to  only 
those  values  appearing  on  a  chart  or  in  a  table. 
Inaccuracies  inherent  in  reading  graphical  charts 
are  avoided.  There  is  less  need  to  rely  upon  the 
approximate  procedures  based  upon  the  standard 
normal  distribution. 

Following  a  brief  discussion  of  accepted 
methods  for  constructing  confidence  bounds  for 
proportions,  this  paper  describes  a  computer 
algorithm  for  implementing  the  exact  procedure. 
Inaccuracies  of  normal  approximations  are 
illustrated  by  several  examples  presented  in  the 
final  section.  In  many  cases  the  approximate  pro¬ 
cedures  realize  only  one  decimal  accuracy.  Very 
rarely  is  the  accuracy  greater  than  two  decimal 
places  for  confidence  levels  of  95Z  or  more  and 
samples  of  up  to  100  observations. 

DISCUSSION 

At  a  recent  ASA  Short  Course  on  nonparametric 
statistics,  Conover  and  Iman  (1981)  suggested 
three  methods  for  constructing  confidence  inter¬ 
vals  for  proportions.  Method  A  requires  the  use 
of  specially  designed  charts,  method  B  uses  exact 
binomial  distributions  tables,  and  method  C  is 
based  upon  the  normal  approximation  of  the  bino¬ 
mial  distribution.  Iman  and  Conover  recommend 
that  one  use  method  A  for  95Z  and  99*  symmetric 
confidence  Intervals,  use  method  B  for  other  con¬ 
fidence  levels  and  samples  not  greater  than  20, 
and  use  method  C  for  large  samples. 

The  charts  required  for  method  A  are  available 
in  Pearson  and  Hartley  (1976).  These  charts  were 
constructed  for  two  confidence  levels  (951  and 
99Z)  and  selected  samples  sizes  from  8  to  1000. 
Each  confidence  bound  was  constructed  by  calcu¬ 
lating  sufficient  points  (  0,  x)  implicit  in  the 
cumulative  binomial  distribution  function  F(x;n,0) 
•  constant  to  draw  smooth  curves  for  the  upper 
(UB)  and  lower  (LB)  bounds  <dien  x  'successes'  are 
observed  in  a  sample  of  size  n.  As  stated  by  the 
authors  (p.  84)  "The  charts  cannot  and  are  not 
Intended  to  provide  very  precise  readings." 
Although  graphical  interpolation  can  be  used  to 
approximate  bounds  for  sample  sizes  greater  than 
eight  that  are  not  explicitly  Included  In  the 
charts,  there  Is  no  way  to  use  the  charts  for 
sample  sizes  less  than  eight. 

Binomial  probability  mass  function  tables 
f(x;n,0)  are  readily  available  in  elementary  and 


intermediate  textbooks  and  reference  books. 
Method  B  requires  that  the  cumulative  binomial 
distribution  function,  F(x;n,0),  for  the  observed 
number  of  'successes'  x  In  a  sample  of  size  n  be 
provided  for  incremental  values  of  0  between  0  and 
1.  Typically  interpolation  on  0  is  required  to 
obtain  UB  and  LB.  For  accurate  determination  of 
the  bounds,  F(x;n,G)  must  be  available  In  very 
small  Increments  of  0  .  Adequate  tables  for  method 
B  are  seldom  found  in  books,  but  simple  computer 
or  programmable  calculator  routines  can  be  easily 
developed  for  this  purpose. 

The  construction  of  confidence  Intervals  for 
proportions  based  upon  the  standard  normal  distri¬ 
bution  is  well  known  and  widely  used.  This 
approximate  procedure  is  based  on  the  central 
limit  theorem  and  estimating  the  standard 
deviation  of  the  sample  proportion  in  the  imple¬ 
mentation  of  the  normal  approximation.  Many  ele¬ 
mentary  textbook  authors  suggest  this 
approximation  is  adequate  for  sample  sizes  greater 
than  thirty.  Fleiss  (1981)  documents  two  refine¬ 
ments  of  the  normal  approximation  method.  One 
refinement,  which  he  recommends  for  sample  propor¬ 
tions  between  0.3  and  0.7,  is  the  familiar  con¬ 
tinuity  correction  factor.  The  other,  which 
Fleiss  suggests  for  extreme  sample  proportions, 
includes  the  continuity  correction  factor  and 
using  the  standard  deviation  of  the  sample  propor¬ 
tion  rather  than  Its  estimate  In  the  normal 
approximation.  The  effect  of  the  latter  is  that 
quadratic  equations  must  be  solved  for  the  con¬ 
fidence  bounds. 

More  central  to  the  development  of  this  paper 
is  the  distinction  between  Iman  and  Conover's 
methods  B  and  C  (and  Fleiss'  refinements  to  method 
C) .  Method  B  is  called  the  exact  method  because 
it  is  based  upon  the  true,  exact  sampling  distri¬ 
bution  for  the  sample  proportion.  Except  for  the 
cumbersomeness  of  Implementing  It,  method  B  Is 
preferred  over  any  of  the  approximations.  Method 
A,  incidentally,  is  nothing  more  than  an  attempt 
to  reduce  the  cumbersomeness  of  method  B  by  using 
graphical  charts.  The  difficulties  of  accurately 
reading  charts  notwithstanding,  method  A  is  pro¬ 
perly  described  as  an  exact  method  for 
constructing  confidence  Intervals  for 
proportions. 

DESCRIPTION  OF  ALGORITHM 

The  algorithm  described  here  Implements  Conover 
and  Iman's  method  B.  It  was  coded  in  MBASIC  to 
generate  the  example  results  below  on  the  Osborne 
I  computer.  Suppose  we  wish  to  construct  a 
(1  -a)  100*  upper  bound  forO based  on  an  observed 
sample  result,  say  x  'successes'  in  n  Independent 
Bernoulli  trials.  Assume  0  <  x  <  n.  An  initial 
estimate  for  the  upper  confidence  bound  (UB)  is 
constructed  using  Fleiss'  quadratic  function  form 
of  the  normal  approximation.  Appropriate  terms  of 
the  probability  mass  function,  f(x;n,UB),  and 
cumulative  distribution  function,  F(x;n,UB),  are 
computed  for  the  binomial  model  with  parameters  n 
and  UB.  The  exact  confidence  level  for  the  upper 
bound  UB  is  [1  -  F(x;n ,UB) ] 100Z.  Based  on  the 
difference  between  the  exact  and  desired  con- 
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fldence  levels,  F(x,n,UB)  -a  ,  the  value  for  the 
upper  bound  Is  revised,  and  the  above  procedure  Is 
repeated.  The  procedure  continues  to  be  repeated 
until  either  the  difference  between  the  exact  and 
desired  confidence  levels  is  acceptably  small  or 
an  established  maximum  number  of  Iterations  is 
executed.  Default  values  are  0.00001  for  the 
accuracy  specification  and  10  iterations. 

The  (1  -a  )100Z  lower  bound  is  constructed  in 
a  similar  fashion.  Fleiss'  quadratic  form  of  the 
normal  approximation  provides  the  Initial  estimate 
LB.  f (x-1 ;n ,LB)  and  F(x-l;n,LB)  are  computed. 
The  exact  confidence  level  for  the  lower  bound  LB 
is  F(x-l;n,LB)100%.  Based  upon  the  difference 
F(x-l;n,LB)  -  (1-cc),  the  value  for  the  lower  bound 
is  revised,  and  the  process  is  repeated  as  indi¬ 
cated  in  the  previous  paragraph. 

The  binomial  probability  mass  function 
f(x;n,0),  for  given  sample  data,  n  and  x,  is 
constructed  using  the  recursive  relationship 

f(0;n,0)  -  (1  -  0)n 

(n-k+l)o 

f (k;n,0)  - f(k-l;n,o),  k-1,2 . . 

k(l-0) 

The  cumulative  distribution  function 
F(x;n,0)  is  constructed  using 

F(0;n,0)  -  f (0;n,0) 

F(k;n,0)»F(k-l;n,0)  +  f(k;n,0),  k-1,2 . x. 

Revision  of  the  boundary  estimate  in  the  itera¬ 
tive  procedure  described  above  is  accomplished  in 
either  of  two  ways.  The  first  way  uses  the  first 
two  terms  of  the  Taylor's  series  expansion  for 
F(x;n,0)  about  the  current  value  of  the  bound,  say 
B.  That  is,  the  revised  value  for  the  bound  is 

1  -  B 

B - [  «-  F(x;n ,B) ] 

(nB  -  x)f(x;n,B) 

The  second  way,  which  is  implemented  only  if 
the  first  way  produces  a  revised  value  that  is 
outside  the  interval  (0,1),  is  the  simple  average 
of  B  and  the  interval  boundary  that  is  exceeded  by 
the  first  way.  That  is,  B/2  is  used  if  the 
revised  value  above  is  less  than  or  equal  to  zero, 
and  (l+B)/2  is  used  if  the  revised  value  above  is 
greater  than  or  equal  to  one . 

The  normal  approximations  used  to  obtain  esti¬ 
mates  for  confidence  boundaries  require  standard 
normal  quantiles.  Quantiles,  accurate  to  four 
decimal  places,  from  the  standard  normal  distribu¬ 
tion  for  selected  probabilities  are  contained  in 
the  computer  code.  TVenty  value  for  tail  probabi¬ 
lities  from  0.0001  to  0.5  were  selected.  Linear 
Interpolation  is  implemented  for  intermediate  pro¬ 
bability  values. 

Efficiency  and  accuracy  in  computing  exact  con¬ 
fidence  levels  for  current  values  of  bounds  are 
enhanced  by  taking  advantage  of  the  complimen- 
tarity  of  variables  in  the  binomial  model .  That 
is,  if  random  variable  X  has  the  binomial  distri¬ 
bution  with  parameters  n  and  0,  then  n-X  is  bino¬ 
mial  with  parameters  n  and  1-0.  The  computations 
of  f(x;n,0)  and  F(x;n,9)  are  Implemented  so  that 
the  iterative  formulas  above  are  taken  from  zero 


to  x  or  n-x,  whichever  is  smaller.  Additional 
efficiency  is  accomplished  by  handling  the  cases  x 

-  0  and  x  -  n  separately  from  the  general  case  0  < 

x  <  n.  Symmetric  confidence  Intervals  do  not 
exist  for  these  special  cases.  When  x  -  0,  the 
exact  (1  -a  )100Z  upper  bound  is  and  there  is 

no  (1  -a)100Z  lower  bound.  When  x  -  n,  the  exact 
(1  -a  )100Z  lower  bound  is  (1  -a  )^n  and  there  is 
no  (1  -a  )100Z  upper  bound. 

In  the  MBASIC  implementation  on  the  Osborne  I, 
the  algorithm  produces  exact  confidence  bounds, 
either  symmetric  (when  they  exist)  or  one  sided  as 
specified  by  the  user,  for  specified  confidence 
levels  from  50Z  to  99.99Z  and  samples  of  up  to  126 
observations.  Accuracy  Is  within  the  specified 
default  value  (0.0001)  for  each  bound;  con¬ 
sequently  the  guaranteed  accuracy  of  symmetric 
confidence  intervals  is  0.00002.  Larger  sample 
sizes  can  be  accomodated,  but  the  accuracy  cannot 
be  guaranteed  due  to  possible  rounding  problems. 
If  the  Iterations  limit  is  reached,  the  algorithm 
displays  the  normal  approximation  bound  and  the 
message:  MAXIMUM  ITERATIONS  COMPLETED.  If  the 

sample  is  too  large  to  accurately  Implement  the 
iterative  formulas  for  f(x;n,0  ),  the  normal 
approximation  is  displayed  with  the  message: 

SAMPLE  SIZE  TOO  LARGE  OR  SAMPLE  PROPORTION  TOO 

NEAR  1/2  TO  COMPUTE  EXACT  BINOMIAL  DISTRIBUTION 
For  large  samples  and  proportions  near  1/2,  the 
initial  term  f(0;n,0)  in  the  iterative  formulas 
underflows  and  is  set  to  zero. 

EXAMPLES 

A  specific  sample  result  is  considered  first 
to  Illustrate  the  operation  of  the  algorithm. 
Attention  is  then  focused  on  the  inaccuracies  of 
the  approximate  procedures.  Inaccuracies  in  con¬ 
fidence  bounds  estimators  for  the  example  intro¬ 
duced  in  the  following  paragraph  and  several 
additional  example  are  discussed. 

Suppose  we  want  to  construct  a  symmetric  99Z 
confidence  interval  estimate  for  the  proportion  of 
'successes'  in  an  infinite  population  based  on  a 
random  sample  of  size  20.  Let  random  variable  Y 
denote  the  number  of  'successes'  in  random  samples 
of  size  20  from  the  population.  Then  Y  has  the 
binomial  distribution  with  unknown  parameterO  and 
n  -  20.  For  Y  observed  'successes',  the  upper 
bound,  UB,  is  the  value  of©  such  that  Prob(Y  _<  y) 

-  0.005;  the  lower  bound,  LB,  is  the  value  ofOsuch 

that  Prob(Y  ^  y)  -  0.005.  UB  is  given  implicitly 
in  terms  of  the  cumulative  distribution  function 
as  F(y;20,UB)  -  0.005.  LB  is  implicit  in 

F(y-1 ;20,LB)  -  .995. 

Now  suppose  that  six  'successes'  are  observed; 
i.e.  y  -  6.  The  MBASIC  algorithm  implements  an 
iterative  procedure  to  solve  for  UB.  The  search  is 
initiated  by  approximating  UB  with  0.6064,  Fleiss' 
quadratic  equation  approximation.  F(6;20,0)  where 
0=0.6064  is  calculated  and  compared  with  0.005. 
If  the  difference  is  within  the  specified 
tolerance,  the  search  terminates.  Otherwise,  the 
value  for  0  is  revised  using  the  first  two  terms 
of  the  Taylor's  series  expansion  for  0.  F(6;20,0) 

using  the  revised  value  of  0  is  calculated  and 
again  compared  with  0.005.  The  search  continues 
until  the  tolerance  specification  is  satisfied. 
The  result  is  UB  -  0.6096.  Fleiss'  approximation 
is  correct  to  two  decimal  places.  The  tail  proba¬ 
bility,  F(6; 20,0.6064 ) ,  beyond  Fleiss'  UB  is 


0.005S  rather  than  the  desired  value,  0.005.  The 
search  for  LB  is  similar.  LB  is  given  implicitly 
by  F(5;20,  0)  -  .995.  Fleiss'  method  gives  LB  - 
0.1013;  the  solution  given  by  the  MBASIC  algorithm 
is  LB  ■  0.0846.  Fleiss'  LB  is  accurate  to  only 
one  decimal  place.  The  tall  probability, 
1-F(5;20,0. 1013)  »  0.0119,  is  more  than  double  the 
desired  value.  Table  I  lists  99Z  symmetric  con¬ 
fidence  limits  for  this  example  using  the  exact 
method  as  veil  as  the  various  normal  approxima¬ 
tions.  Tall  probabilities  for  the  various  bounds 
estimators  are  also  listed. 

Exact  and  both  simple  and  quadratic  approxima¬ 
tions  for  99Z  confidence  intervals  were 
constructed  for  sample  results  from  x  -  1  'suc¬ 
cess'  to  x  *  10  'successes'.  The  simple  approxi¬ 
mation  produced  negative  lower  bounds  estimates 
for  samples  with  five  or  fewer  'successes'.  That 
is,  in  terms  of  the  sample  proportion  p,  lower 
bounds  when  p“£  0.25  are  negative.  Only  one  deci¬ 
mal  place  accuracy  is  achieved  for  0.30  _<  p 
0.45.  The  greatest  accuracy  achieved  is  only  two 
decimal  places  when  p  -  0.50.  Similar  inac¬ 
curacies  are  realized  using  the  quadratic  approxi¬ 
mation  except  that  negative  lower  bounds  are 
averted.  Only  one  decimal  accuracy  is  achieved 
for  p  <  0.45,  and  the  greatest  achieved  accuracy 
is  only  two  decimal  places  when  p  -  0.50. 

Additional  examples  were  considered  using 
samples  of  sizes  30,  50,  and  100  observations. 

Exact  and  both  simple  and  quadratic  approximations 
for  95X  and  99Z  confidence  interval  bounds  were 
calculated.  Achieved  accuracies  for  the  simple 
approximation  method  are  sunarized  in  Figures  1 
and  II.  Using  the  simple  approximation  for  99Z 
confidence  level  bounds,  only  one  decimal  place 
accuracy  is  achieved  for  samples  of  100 


(Figure  IV),  at  least  two  decimal  place  accuracy 
is  realized  everywhere  except  for  p  <_  0.67  and 
samples  of  size  30.  Three  decimal  accuracy  is 
achieved  for  £  >_  0.42  and  samples  of  100  obser¬ 
vations. 

CONCLUSIONS 

The  accuracy  of  confidence  interval  estimators 
for  proportions  based  upon  the  simple  normal 
approximation  are  limited  to  two  decimal  places 
for  samples  up  to  size  100  and  confidence  levels 
of  at  least  95Z.  Two  decimal  place  accuracy  Is 
the  greatest  achievable,  and  in  many  cases  only 
one  decimal  place  accuracy  is  realized.  The 
quadratic  approximation  provides  some  Improvement, 
but  the  accuracy  is  still  limited  to  two  decimal 
place  accuracy  except  for  sample  proportions  very 
near  one  half  and  sample  sizes  of  at  least  100 
observations. 

The  alternative  to  using  the  approximate  proce¬ 
dures  is  to  construct  exact  confidence  Intervals 
for  proportions.  This  paper  describes  an  imple¬ 
mentation  of  the  exact  procedure  in  an  interactive 
computer  algorithm.  Programmed  in  MBASIC  code  on 
the  Osborne  I  computer,  five  place  accuracy  is 
realized  for  samples  of  up  to  126  observations. 
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TABLE  I 

99Z  Confidence  Interval  for  8  Based 
on  6  'successes'  in  20  Trials 
(Tall  probability  beyond  bound) 


Method 

LB 

— 

UB 

Exact 

0.0846 

0.6096 

(0.0050) 

(0.0050) 

Iman  &  Conover's 

0.0361 

0.5639 

Method  C 

(0.0001) 

(0.0157) 

Method  C  with 

0.0611 

0.5889 

cont .  corr . 

(0.0010) 

(0.0086) 

Method  C  with 

0.1013 

0.6064 

quadratic  eqn. 

(0.0119) 

(0.0055) 

observations  and  $  <  0.27.  The  greatest  accuracy 
of  two  decimal  places  is  realized  when  p  0.30. 
Reducing  the  confidence  level  to  95Z  provides 
little  improvement  in  the  accuracy  of  the  simple 
approximation  bounds;  two  decimal  accuracy  is 
realized  for  jf>_0.27. 

The  quadratic  approximation  method  for  99Z  con¬ 
fidence  Interval  bounds  provides  two  decimal  place 
accuracy  for  samples  of  size  50  and  &  >_  0.16  and 
for  all  samples  of  size  100.  (See  Figure  III.) 
When  the  confidence  level  is  reduced  to  95Z 
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TUNING  COMPUTER  SYSTEMS  FOR  MAXIMUM 
PERFORMANCE  :  A  STATISTICAL  APPROACH. 

William  A.  Nazaret 
William  J.  K tingle r 

AT&T  Bell  Laboratories 
Holmdel,  New  Jersey  07733 

In  this  paper  we  discuss  a  statistical  approach  to  the  problem  of  setting  the 
tunable  parameters  of  an  operating  system  to  minimize  response  time  for 
interactive  tasks.  This  approach  is  applicable  to  both  tuning  and  benchmarking 
of  computer  systems.  It  is  based  on  the  use  of  statistical  experimental  design 
techniques  and  represents  an  inexpensive,  systematic  alternative  to  traditional 
ways  of  improving  the  response  of  systems.  We  illustrate  the  method  by  means 
of  experiments  performed  on  VAX*  minicomputers  running  under  the  UNIX* 
System  V  Operating  System. 


/.  INTRODUCTION. 

The  work  described  in  this  paper  has  been 
motivated  by  two  different,  but  related  problems, 
that  arise  in  the  analysis  of  computer  performance. 
The  first  is  how  to  set  an  operating  system’s  tunable 
parameters  to  achieve  the  best  response  time  for 
interactive  tasks,  given  the  computer’s  load 
conditions.  The  second  is  how  to  map  the 
relationship  among  the  different  tunable  parameters 
of  the  system  and  their  impact  on  response  time. 

Although  the  second  problem  appears  to  be  a 
generalization  of  the  first  in  practice  the  two  appear 
in  different  contexts.  The  first  problem  is  normally 
confronted  by  system  administrators  in  their  attempt 
to  get  the  most  performance  out  of  the  system  on 
behalf  of  the  users.  In  contrast,  the  second  question 
is  tackled  mostly  by  system  designers  and 
performance  analysts  who  are  charged  with  modeling 
system  performance  under  a  variety  of  loads  before 
the  systems  are  actually  handed  to  the  customers. 
This  activity  is  sometimes  called  "benchmarking". 

The  system  administrator  goal  is  to  optimize  the 
response  for  important  tasks  in  his/her  organization 
under  the  particular  load  conditions  Therefore  we 
could  say  imposed  on  the  system  by  the  users,  that 
this  problem  is  "local"  by  nature.  On  the  other 
hand,  the  responsibility  of  the  system  designer  and 
performance  analyst  is  to  understand  how  the  system 
reacts  to  changes  in  the  tunable  parameters  for  each 
of  many  different  loads  that  are  likely  to  be 
encountered  on  the  field.  In  this  sense  the  problem 
is  rather  "global". 

A  very  important  consequence  of  this  distinction 
is  that  the  measurements  used  for  benchmarking  are 
usually  made  under  "simulated"  loads  that  arc 


designed  to  exercise  the  system  in  a  given  manner 
and  on  which  the  experimenter  has  total  control. 
Tuning,  instead,  is  done  using  data  generated  by  the 
actual  load  of  users  on  the  system.  This  type  of  load 
is  not  under  the  complete  control  of  the 
administrator  or  the  experimenter.  Despite  the 
above  differences,  tuning  and  benchmarking  have 
something  in  common  :  the  necessity  to  experiment 
with  different  settings  for  the  parameters  in  search 
for  a  configuration  that  yields  the  best  results. 

In  this  paper  we  present  a  systematic,  cost 
effective  approach  to  conducting  these  experiments. 
This  approach  makes  use  of  statistical  techniques  to 
design  experiments  which  yield,  in  many  cases, 
information  nearly  equivalent  to  the  one  obtained  by 
performing  a  complete  exhaustive  test.  Our 
approach  although  not  new  to  statisticians  is 
becoming  popular  among  systems  managers  and 
performance  analysts  as  an  alternative  to  more 
traditional  methods  of  experimentation. 

Throughout  the  paper  we  use  the  UNIX* 
operating  system  as  an  example  of  a  tunable 
operating  system.  However,  the  method  is  applicable 
to  any  operating  system  (or  system  in  general)  which 
allows  the  user  the  freedom  to  adjust  its  operating 
characteristics.  In  Section  Two  we  present  an 
overview  of  UNIX  tunable  parameters  and  their 
potential  impact  c>n  system’s  response.  Section  Three 
introduces  the  experimental  problem  by  describing 
three  experiments  carried  out  on  VAX  780,  785  and 
8600  machines  respectively.  Section  Four  explains 
our  statistical  strategy  to  estimate  the  effect  of  the 
parameters  on  response  time  for  certain  tasks.  In 
Section  Five  we  analyze  the  results  of  the 
experiments  and  show  the  improvement  achieved 
after  adjustment  of  the  parameters  according  to  these 
results.  Finally,  in  Section  Six  a  critique  of  our 
approach  is  given  together  with  some  extensions. 
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2.  WHAT  DOES  TUNING  A  COMPUTER  MEAN  ?. 


Tuning  a  computer  system  is,  in  principle,  not 
very  different  from  tuning  any  system  in  general.  It 
amounts  to  finding  the  setting  of  certain  parameters 
to  satisfy  performance  requirements.  Hence,  the 
performance  requirements  determine  which  settings 
are  "optimal"  for  an  application.  For  instance,  of  a 
typical  passenger  car  will  be  tuned  to  optimize  fuel 
consumption  and  reduce  emissions.  In  contrast,  a 
racing  car  will  be  tuned  to  maximize  speed  at  the 
expense  of  fuel  economy. 

The  UNIX  system  allows  fine  tuning  by  giving 
the  administrator  the  freedom  to  set  the  values  of 
some  kernel  parameters  at  boot  time.  Additionally, 
one  can  exercise  control  over  options  that  are  not 
part  of  the  kernel.  Some  of  them  relate  to  the 
hardware  and  some  to  software.  Examples  of  these 
parameters  and  their  significance  for  the  run  time 
environment  are  : 

A.  System  Buffers  :  These  are  chunks  of  physical 
memory,  typically  1024  kilobytes  in  size,  which 
are  used  by  the  operating  system  to  keep 
recently  used  data  in  hope  that  it  might  be  used 
shortly  afterwards.  Increasing  the  number  of 
these  buffers  improves  the  "hit  ratio"  on  this 
cache  up  to  a  point.  An  excessive  amount  of 
these  buffers  will  hurt  performance  for  it  takes 
away  memory  space  from  the  users. 

B.  Sticky  Processes  :  There  is  a  bit  associated  with 
the  permissions  on  an  executable  file  that  will 
cause  its  text  segment  to  be  stored  in 
contiguous  blocks  on  the  swapping  device. 
Commands  which  are  frequently  invoked 
(specially  those  with  large  images)  ought  to 
have  the  sticky  bit  set  so  that  every  time  they 
are  invoked  their  code  can  be  brought  into 
memory  as  easily  as  possible  by  the  system. 
Systems  in  which  this  is  not  done  usually  suffer 
from  chronic  disk  I/O  bottleneck  and  the 
resulting  degradation  in  response  time.  The 
number  and  kind  of  commands  with  the  sticky 
bit  set  is  a  tunable  parameter. 

C.  Paging  Daemon  Parameters  :  In  virtual 
memory  implementations  of  UNIX,  memory 
used  by  processes  is  assigned  on  a  per-page 
basis.  A  page  is  just  a  piece  of  the  code  usually 
512  or  1024  Kilobytes  long.  The  paging 
daemon  is  a  system  process  whose  responsibility 
is  to  free  up  memory  by  reclaiming  space 
occupied  by  pages  which  are  no  longer  in  use. 
A  process  can  also  be  stripped  of  its  pages  if  its 
total  CPU  time  exceeds  a  given  value.  How 
often  the  daemon  runs,  how  many 
simultaneous  active  pages  a  process  can  have 
and  the  maximum  CPU  time  quota  before  a 
process  is  swapped  out  are  tunable  parameters. 


D.  File  System  Organization  :  This  is  a  highly 
installation  dependent  parameter.  The  idea  is 
to  distribute  the  system  and  user  files  among 
the  available  disks  in  a  way  that  the  load  on 
each  of  them  is  approximately  the  same.  When 
one  of  the  disks  is  overloaded  I/O  waits 
increase  and  response  is  degraded. 

E.  CPU  Assist  Devices  :  Some  types  of  hardware 
allow  the  possibility  of  adding  coprocessors  or 
add-on  boxes  to  relieve  the  CPU  of  mundane 
chores.  Some  examples  are  terminal  I/O  assist 
devices  and  troff  coprocessors.  The  use  and 
number  of  such  devices  can  be  subjected  to 
tuning. 

F.  Main  Memory  :  By  allowing  minor  changes  in 
the  amount  of  physical  memory  on  the  system 
it  is  possible  to  detect  whether  increasing 
memory  size  will  help  to  enhance  the 
performance  of  the  system.  This  is  helpful  to 
know  before  committing  any  resources  into 
buying  the  additional  boards. 

3.  THREE  CASE  STUDIES. 

To  illustrate  the  methodology  we  will  describe 
three  experiments  carried  out  on  VAX*  780,  785 
and  8600  systems  respectively,  running  under  the 
UNIX  System  V  Operating  System  at  the  Quality 
Assurance  Center  of  AT  &  T  Bell  Laboratories. 

The  first  of  the  three  was  conducted  with  almost 
an  entirely  tuning  orientation,  and  it  is  somewhat 
similar  to  the  one  reported  in  [1].  Our  goals  were  to 
improve  response  on  a  system  whose  performance 
was  becoming  unbearable  and  to  overcome  some 
questionable  aspects  of  the  experience  in  [lj.  Among 
these  aspects  we  targeted  : 

•  Duration  of  the  experiments  :  The  experiment 
described  in  [1J  lasted  more  then  three  months. 
We  believe  that  any  approach  to  tuning  which 
takes  this  long  to  produce  results  has  very  limited 
practical  value.  Therefore,  one  of  our  goals  was 
to  find  ways  to  obtain  useful  results  in  reasonable 
amount  of  time. 

•  Stationarity  of  the  load  :  A  second  goal  was  to 
ensure  that  the  load  conditions  were  reasonably 
stationary.  We  wanted  to  avoid  a  situation  in 
which  it  was  not  possible  to  detect  whether 
improvements  in  performance  were  due  to  the 
tuning  or  rather  due  to  a  decreased  level  of  load 
on  the  system  (this  is  essentially  what  happened 
in  [1]). 

The  tunable  parameters  considered  in  this 
experiment  were  :  file  system  organization,  memory 
size,  system  buffers,  sticky  processes,  KMC’s 
(terminal  I/O  processors)  and  PDQ’s  (troff  co¬ 
processors). 
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The  second  experiment  on  a  VAX  785  system 
included  only  5  of  the  6  parameters  considered  in  the 
previous  one.  Sticky  processes  were  dropped  out  of 
consideration  because  we  already  had  considerable 
prior  knowledge  about  how  to  handle  them.  The 
experiment  was  designed  to  allow  us  to  estimate,  in 
addition  to  the  main  effects,  of  the  factors,  some  of 
the  interactions  among  them.  Therefore  it  resulted 
in  a  larger  number  of  trials.  Our  intentions  here 
went  beyond  tuning  the  system.  We  also  wanted  to 
assess  the  merits  of  fancier  (more  expensive)  design 
plans  relative  to  simple  plans  like  the  one  used  in 
Experiment  One.  The  precise  meaning  of  this  will 
be  given  in  next  section  when  we  discuss  the 
experimental  strategy. 

The  third  experiment  differed  from  the  previous 
ones  in  a  very  important  characteristic.  The  load 
used  was  not  a  "live"  load  but  rather  a  simulated  load 
built  using  a  strategy  to  imitate  the  essentials  aspects 
of  the  type  of  load  we  expected  this  system  to  be 
subjected  to.  The  factors  considered  were  the  same 
as  in  the  second  experiment  except  that  KMC's  were 
not  included  for  they  are  not  necessary  in  this  new 
VAX  8600  model.  By  using  a  simulated  load,  the 
experimental  conditions  were  completely  under  our 
control  and  thus  we  expected  to  get  more  statistically 
reliable  results.  However,  there  was  no  previous 
experience  in  using  our  method  in  this  context  and 
little  was  known  about  the  validity  of  the  conclusions 
under  actual  load  conditions.  What  we  learned  from 
this  is  discussed  in  Section  Six. 

4.  THE  EXPERIMENTAL  STRATEGY. 

Before  basic  planning  for  the  experiment  can  be 
done,  we  need  to  choose  the  levels  at  which  each  of 
the  factors  is  going  to  be  tried.  Since  the  set  of 
factors  considered  for  Experiment  One  (VAX-780) 
contain  the  ones  for  the  other  two  systems  it  will 
suffice  to  describe  the  levels  chosen  in  that  case. 
Figure  1  shows  the  levels  for  each  one  of  the  six 
factors. 

In  Figure  1  the  amount  of  system  buffers  space 
allocated  depends  on  the  total  size  of  memory. 
Hence,  Low,  Medium  and  High  represent  a 
different  fraction  of  memory  for  each  of  the  three 
memory  sizes.  We  determined  the  amount  of 
memory  assigned  to  these  buffers  using  a  formula  of 
the  form 

Sysbuff  =  CL  +  .2  •  Kl 

where  CL  is  1.0,  1.2  and  1.4  Megabytes  when  L  is 
Low,  Medium  and  High  respectively.  Similarly,  KL  is 
0,1  or  2. 

The  choice  of  levels  for  the  factors  is  highly 
installation  dependent  and  must  be  done  taking  into 
account  both  the  characteristics  of  the  load  and  prior 
knowledge  about  how  changes  in  these  parameters 
are  supposed  to  affect  the  system’s  response  ([4]  is 
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an  excellent  reference  for  this).  A  common  strategy 
starts  from  the  current  settings  and  introduces  some 
variations  around  them.  The  size  of  this  variation 
ranges  from  modest  to  large.  Minor  variations  defeat 
the  purpose  of  the  experiment.  One  exception  to  this 
strategy  occurs  when  the  current  setting  of  one  of  the 
parameters  is  clearly  wrong  (non-optimal).  In  Figure 
1  we  have  such  an  example.  The  choice  of  setting  for 
sticky  processes  was  to  set  the  10,  20  and  30  most 
popular  commands.  The  number  of  sticky  processes 
before  the  experiment  was  about  40  but  they  were 
not  chosen  according  to  frequency  of  invocation.  We 
therefore  ignored  them. 

Next  we  had  to  decide  on  the  measures  to  assess 
the  performance  of  the  system  under  the  different 
experimental  conditions.  A  common  choice  is  the 
time  the  system  needs  to  execute  a  script  containing 
tasks  important  to  the  organization  .  For  instance, 
in  a  text  processing  organization  such  a  script  would 
consist  of  formatting  a  document.  The  four  measures 
we  typically  use  are  :  trivial  time,  edit  time,  troff 
time  and  c-compile  time.  Among  those  only  whose 
name  is  not  self-explanatory  is  "trivial"  time.  This  is 
just  the  response  time  for  a  command  that  involves 
no  interaction  between  the  user  and  the  system  (e.g. 
the  "date”  command).  It  gives  a  measure  of 
instantaneous  response  time. 

Deciding  on  a  sampling  plan  for  the  experiment  is 
a  crucial  and  difficult  task.  As  we  noted  above,  we 
wanted  to  reduce  the  total  time  for  the  experiment, 
as  much  as  possible,  without  compromising  the 
integrity  of  the  methodology.  After  careful  study  of 
the  load  in  the  VAX-780  and  VAX-785  systems  we 
decided  that  it  was  safe  to  use  a  day  as  the  basic 
duration  of  a  run.  A  day  is  a  natural  unit  because  it 
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allows  you  Co  set  up  the  system  from  one  run  to  the 
next  during  off-hours  therefore  sparing  the  users  any 
inconvenience.  This  choice  is  also  minimal  in  the 
sense  that  to  have  the  runs  last  less  than  a  day  will 
force  you  to  interfere  with  the  normal  functioning  of 
the  system.  More  importantly,  it  will  make 
comparisons  among  runs  invalid  due  to  the  within 
day  variations  of  the  load  (peak  hours).  We  must 
note  that  a  day  may  even  troublesome  in  some 
installations  were  the  behavior  of  the  load  depends, 
say,  on  the  day  of  the  week. 

In  the  third  of  the  experiments  a  whole  run  took 
only  about  an  hour  as  opposed  to  a  whole  day.  This, 
of  course,  was  due  to  the  use  of  a  simulated  load. 

Response  times  were  measured  at  evenly  spaced 
intervals  during  the  run.  It  is  important  not  to 
oversample  since  the  load  caused  by  the  timing 
programs  and  the  timed  scripts  could  interfere 
negatively  with  the  users. 

The  most  important  aspect  of  the  whole 
experimental  strategy  is  the  choice  of  combinations 
to  be  tried  during  the  experiment.  A  complete 
exhaustive  search  will  most  likely  give  us  the  right 
answer.  However,  it  is  not  hard  to  realize  that  the 
time  this  would  take  is  prohibitive.  For  instance, 
from  Figure  1  we  gather  that  it  will  take  about  729 
days  to  run  an  all-combinations  experiment  !.  Even 
in  the  case  of  the  third  experiment  (with  simulated 
loads)  the  administrative  overhead  is  overwhelming. 
It  requires  re-booting  the  system  81  times  !.  Our 
strategy  is  to  run  a  fraction  of  all  possible  runs 
following  an  array  of  combinations  that  allows  us  to 
test  all  factors  simultaneously  in  a  fair  way.  Arrays 
with  this  property  are  documented  extensively  in  the 
statistical  literature  (See  [2]  and  [3)).  For  instance, 
the  design  used  in  Experiment  1  (VAX  780)  is  given 
by  Figure  2.  It  was  constructed  using  an  orthogonal 
array  known  as  the  Ljg,  consisting  of  only  18  runs. 

For  the  second  experiment  we  used  another 
orthogonal  array  known  as  the  L^i  (see  (2j), 
consisting  of  27  runs.  The  increased  size  of  the 
experiment,  as  we  mentioned  previously,  was 
deliberatedly  planned  to  allow  us  to  estimate 
together  with  the  main  effects,  the  interactions 
between  memory  size  and  the  other  four  factors. 
Finally,  the  controlled  experiment  (VAX  8600)  with 
only  four  factors  was  run  following  a  plan  based  on  a 
fraction  of  a  34  array  consisting  of  just  9  runs. 

The  advantages  of  using  design  plans  like  the 
ones  above  are  : 

•  They  provide  an  average  picture  over  the  whole 
parameter  space. 

•  The  estimate  of  the  effect  of  any  of  the  factors  is 
orthogonal  with  respect  to  those  of  the  other 
factors. 
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•  Under  certain  conditions  they  yield  information 

approximately  equivalent  to  what  you  would 

obtain  by  using  a  much  larger  experiment. 

There  is  of  course  a  price  to  pay  for  this.  These  plans 
achieve  the  reduction  in  size  of  the  experiment  by 
deliberatedly  confounding  the  main  effect  of  the 
factors  with  the  "joint"  effect  or  interaction  of  some 
of  the  other  factors.  Therefore,  they  rely  on  the  size 
of  the  main  effects  to  be  dominant. 

In  spite  of  the  above  we  advocate  a  strategy  based 
on  choosing  a  highly  fractioned  array  because,  at  the 
very  least,  it  provides  a  most  inexpensive  starting 
point  from  which  we  can  always  obtain  very  useful 
information  shout  the  parameters.  In  particular, 
these  experiments  lend  themselves  to  be  extended,  if 
necessary,  to  allow  the  estimation  of  higher  order 
effects  (interactions)  if  the  data  suggest  they  might 
be  important.  In  practice,  we  have  seldomly  had  to 
go  pass  on  iteration  in  this  cycle.  In  most  cases  the 
information  provided  by  the  data  is  such  that  the 
improvement  achieved  by  the  predicted  optimal 
setting  (as  verified  by  a  confirmatory  run)  makes  the 
extra  effort  involved  in  conducting  additional 
experiments  unattractive. 

J.  DATA  ANALYSIS. 

Due  to  space  constraints  we  can  not  present 
summaries  and  analyses  of  the  data  for  each  of  the 
three  experiments.  We  can  however  show  a  selected 
subset  of  plots  summarizing  the  information  provided 


by  the  data  with  respect  to  how  changes  in  the 
factors  affect  performance.  We  will  also  show  plots 
to  illustrate  what  was  achieved  by  re-setting  the 
parameters  to  the  levels  suggested  by  the 
experimental  data. 

Figure  3  shows  the  estimated  effects  that  each  of 
the  four  parameters  in  Experiment  number  3  (VAX 
8600)  had  on  c-compile  time.  The  performance 
measure  in  this  case  was  mean  response  time. 
Another  sensible  choice  would  be  mean  square 
average.  This  latter  criterion  has  the  advantage  of 
picking  the  setting  that  minimizes  a  sum  of  the 
variability  and  the  square  of  the  average.  The  time 
scale  is  seconds. 
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These  conclusions  are  valid  only  for  this 
particular  performance  measure  and  for  this  system. 
There  is  no  reason  why  they  have  to  hold  for 
another  response  measure  (like  edit  time)  even 
under  the  same  load.  For  instance,  one  could  argue 
that  not  adding  PDQ’s  to  the  system  could  hurt  text 
processing  performance  and  this  could  very  well  be 
the  case.  In  general  the  answer  depends  on  the 
relative  level  of  capability  of  the  processor  to  handle 
the  load.  As  a  matter  of  fact,  to  our  surprise  we  have 
seen  cases  in  which  adding  PDQ’s  hurts  "troff" 
response  time.  An  explanation  to  this  puzzling  event 
can  be  found  in  the  fact  that  the  VAX  processor  is 
several  times  more  powerful  that  the  microprocessor 
which  drives  the  PDQ’s.  Therefore  every  time  a  text 
processing  job  is  sent  to  the  PDQ  when  the  CPU 
could  indeed  had  handled  it,  a  loss  in  performance  is 
sure  to  occur.  Indeed  ,only  in  the  first  experiment 
the  results  suggested  the  inclusion  of  one  or  more 
PDQ’s.  For  this  system  the  load  was  so  heavy,  that 
there  was  little  change  in  troff  performance  by 
adding  or  excluding  the  PDQ’s,  while  there  was  a 
positive  effect  on  edit  and  trivial  response  times 
upon  adding  them. 
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A  summary  of  the  conclusions  extracted  from 
these  data  is  : 

•  The  machine  should  be  run  at  its  current  level  of 
8  Megabytes,  the  gradient  information  in  Figure  3 
does  not  suggest  additional  gains  if  another  4 
Megabytes  are  added  (memory  can  be  bought  in 
4  Mb  units). 

•  The  number  of  system  buffers  can  be  set  to  low 
which  means  about  1.4  Megabytes  of  memory  for 
the  system  and  the  rest  for  the  users. 
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•  File  system  organization  C  is  advantageous  over 
either  A  or  B. 

•  No  Pdq’s  should  be  used. 


A  way  to  check  the  gains  in  performance  after 
tuning  is  to  run  back  to  back  confirmatory  runs 
under  the  old  and  new  settings.  The  results  for  one 
of  the  three  systems  is  given  in  Figure  4.  In  the 
graphs  the  dotted  curve  represents  the  response  time 
before  the  experiment  and  the  solid  curve  represents 
the  response  after  the  tuning.  We  see  that  both 
trivial  and  response  times  were  reduced  considerably 
after  tuning  the  system. 

6.  CONCLUSIONS 

We  were  able  to  reduce  response  time  for  several 
typical  tasks  in  all  three  systems.  For  the  first  one,  in 
average  we  reduced  response  by  38%,  an  important 
gain  in  a  system  that  was  considered  hopeless.  In  the 
second  system,  we  additionally  discover  that  blind 
use  of  PDQ’s  could  lead  to  loss  in  performance  for 
text  processing  jobs.  The  evaluation  of  the  results 
for  the  third  system  is  still  under  way.  Early  data 
seem  to  indicate  that  the  configuration 
recommended  by  the  simulated  load  experiment 
enables  the  machine  to  handle  the  real  load  rather 
easily. 

We  also  have  a  much  better  assessment  of  the 
real  usefulness  of  these  experiments  for  both  tuning 
and  benchmarking.  The  results  of  back  to  back 
confirmatory  runs  (a  week  each)  showing  substantial 
reductions  in  response  time  for  both  of  the  tuning 
experiments,  indicate  that  it  is  possible  to  use  this 
approach  successfully  for  periodical  system  tuning. 
We  can  not  however  claim  that  it  will  succeed  in 
general.  Instead  we  can  say  that  as  long  as  attention 
is  confined  to  only  a  few  factors  (therefore  keeping 
the  duration  short)  and  the  load  is  relatively  stable, 
the  method  will  help  to  run  your  system  better.  The 
experience  with  simulated  loads  convinced  us  that 
the  usefulness  of  this  approach  in  benchmarking 
studies  is  even  greater.  In  fact,  we  are  currently 
using  our  approach  to  find  the  functional  relation 
between  the  parameters  of  the  UNIX  System  V 
Virtual  Memory  Management  scheme  and  response 
time. 

Finally,  in  response  to  questions  about  the 
dangers  involved  in  ignoring  the  interaction  among 
some  of  the  factors  we  would  like  to  point  out  that, 
yes,  there  is  a  risk.  However,  even  in  cases  were  we 
had  (upon  analysis  of  the  data)  second  thoughts 
about  the  absence  of  such  interactions,  the 
improvements  over  the  previous  system  configuration 
achieved  by  using  information  generated  during  the 
experiments  made  the  risk  worth  taking. 


*  UNIX  is  a  trademark  of  AT&T  Bell  Laboratories 

*  VAX  is  a  trademark  of  Digital  Equipment  Corporation 
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FITTING  PARAMETRIC  AND  SEMI-PARAMETRIC  PROBIT  FORMS 
WITH  NON-ZERO  BACKGROUND 

Haiganoush  K.  Preiiler,  USDA-Forest  Service, Berkeley 


1. Introduction:  In  a  coamonly  need  qaantal 
response  bioaaaay  model  for  toxicological 
experiaienta,  the  binomial  reaponae  probabilty  p 
ia  aet  to  p=n+( l-n)F(q) .where  0^n<l  ia  a 
paraaeter  correaponding  to  the  proportion  of 
background  reaponaea  (natnral  mortality),  F  ia 
the  probit  (logit  or  aoaie  other  mathematical) 
function  and  q  ia  a  amooth  function  of 
covariatea.  Parametric  probit  regreaaion  node  la 
are  obtained  by  aubatituting  a  known  parametric 
function  of  the  covariatea  for  q  .  Eatimation  in 
the  parametric  probit  regreaaion  mode  la  with 
zero  background  (i.e.  tt=0 )  can  be  handled 
directly  by  the  GLIM  atatiatical  package  (Baker 
and  Nelder, 1978) .  Model  a  with  n>0  require 
apecial  treatment  becauae  they  do  not  fit  within 
the  framework  of  generalized  linear  model  of 
GLIM.  Baaaelblad  et  al.  (1980)  deacribe  a 
fitting  procedure  for  thia  model  that  employe 
the  EM-algorithm.  Cox  (1984)  uaee  the  derivative 
free  program  BMDPAR  and  a  ahort  FORTRAN  program 
to  obtain  the  eatimatea.  Ruaaell  et  al.  (1977) 
have  written  a  FORTRAN  program  epecifically  to 
handle  the  above  model  with  n)D.  It  producer 
maximum  likelihood  eatimatea  of  the  parametera 
and  performa  teata  of  paralleliem  and  equality 
among  treatment  groupa.  In  auction  2  of  thia 
paper  I  demonatrate  how  the  GLIM  package  can  be 
modified  to  allow  the  fitting  of  the  parametric 
model  with  non-zero  background. 

An  eatimation  procedure  for  the 
non-parametric  model,  with  q  an  unapecified 
amooth  function  of  the  covariatea  and  n  zero,  ia 
diacuaaed  in  Haatie  and  Tibahirani  (198S) .  In 
aectiona  4  and  S  of  thia  paper  I  preaent  an 
eatimation  procedure  for  the  aemi-parametric 
cate  were  q  ia  an  unapecified  function  and  n  ia 
an  unkown  parameter.  The  procedure  utilizer  the 
apeeific  ACE  algorithm  of  Breiman  and  Friedman 
(198S).  It  producer  eatimatea  of  the  functiona 
of  the  covariatea  that  minimize  a  weighted 
reeldual  error  criterion.  (laea  of  theae 
procedures  are  illustrated  by  examples  from 


insecticide  bioaaaay  atndiea.  The  GLIM  mac  roe 
employed  to  perform  the  parametric  eatiaiation 
are  liated  in  the  appendix.  The  aemi-parametric 
fitting  war  done  via  ACE  implemented  within  S 
(Becker  and  Chamber  a  1984)  . 

2.  Parametric  orobit  with  backiround. 

Contider  the  fitting  of  the  model 


I  j " 


E<yij,“,,ij“Ri+(1-ni>F<Xij^>"ij 

(2.1) 

where  y^,  i«l, .  .,1,  j*l . J^.it  the 

proportion  of  reaponaea  out  of  a  total  of  n 
P  is  a  p-vector  of  parametera,  x^  ia  a 
p-vector  of  covariatea  and  n  ia  an  indicator 
variable  that  ia  aet  to  zero  for  the  background 
reaponaea  and  to  one  otherwiae.  If  we  aaaume  the 
number  of  reaponsea  to  be  binomial,  then 

v.r(yij)=oij=Pij(l-Pij)/“ij 

and  yij  may  be  written  aa  p^+o^e,  with  e 
an  error  variate. 

The  algorithm  need  by  GLIM  for  a  generalized 
linear  model  with  a  link  function  q=g(p)  ia  at 
followt,  for  a  given  current  eatimate  q^  of 
the  linear  predictor,  regreaa 
qo+(y-pf>)(dq/d|i)o 

on  the  vector  of  covariatea  x  with  weighta 
defined  by  wo*=(dq/dp) *of 

In  order  to  use  the  GLIM  algorithms  for  the 
model  in  (2.1)  with  unknown  n  the  following 
procedure  may  be  followed.  Linearize 
around  a  previous  eatimate  to  obtain 

Tij°iiij+("r9i,[1“F(*ij>,“ij] 

+(*-!)  *  <i-ft1)xijuiJf(xiJi>+&ljs 
Here  f  ia  the  derivative  of  F.  Next,  rearrange 
the  terms  in  the  equation  above  and  set 
f,(xj)*uf(xp)  +  (l-u)  and  m“(l-*)f,(x|))  to 
obtain 


Vij4*’* 


ijnij+ 


Vij4*'* 


<y1j-Mij)/(1-ni)f,(xiJ»- 

‘JV  1 


Thus,  by  defining  the  linear  predictor  with  a 
new  explanatory  variable  a,  i.e.  q=itm+p  'xu,  and 
setting  ( dvj /d|i)  ”l/(l-it)  f#,  the  GLIM  algorithms 
can  be  nsed  for  estimating  the  parameters  n  and 
P.  Since  m  is  a  function  of  the  fitted  values, 
it  most  be  recalculated  at  each  iteration. 
Scallan  (1982)  uses  a  similar  technique  to  find 
maximum  likelihood  estimates  for  some  other 
models  with  ’parametric*  link  functions  such  as 
the  hyperbolic  curve  with  E(y)=(x+A)/(u+Px) 
where  A  is  the  extra  parameter  to  be  estimated. 

3. A  toxicological  example  with  non-zero 
backs round.  Data  from  an  insecticide  bioassay 
experiment  is  analyzed  in  this  section.  The 
experiment  consisted  of  treating  samples  of 
male  and  female  larvae  from  a  particular  insect 
population  with  2  different  chemicals.  A  fixed 
amount  of  insecticide  was  applied  to  the  surface 
skin  (topical  application)  of  each  insect, 
(Robertson  and  Kiaib all,  1979) .  A  control  group 
was  treated  with  solvent  only.  Mortality  was 
tallied  after  7  days. 

A  number  of  probit  models  were  fitted  to  the 
data  using  the  GLIM  package  in  conjunction  with 
the  macros  in  the  appendix.  All  models  included 


one  covariate,  the  logarithm  of  the 
concentration  of  insecticide  used.  The  most 
general  model  included  separate  background 
rates,  intercepts  and  slopes  for  each  sex  and 
treatment  while  the  simplest  model  included  the 
same  (common)  background  with  the  same  intercept 
and  slope  for  each  sex  and  treatment  (table  1). 
The  normal  quantiles  of  the  data,  corrected  for 
background  using  the  coanon  estimate,  are 
plotted  in  figure  1  along  with  the  fitted  model 
B. 

Table  1 

Fits  of  various  models  to  the  insect  data 


Model  scaled 

deviance 

d.f. 

A.  Common  it,  slope 

118 

21 

and  intercept 

B.  Common  it.  slope. 

23 

18 

separate  intercepts 

C.  Common  it,  separate 

21 

15 

slopes, intercepts 

D.  Separate  it’s,  slopes 

17 

12 

and  intercepts 


LOGARITHM  TEN  OF  DOSE 


Figure  1.  Mortality  of  larvae  treated  with  insecticides  and  the  fitted  problt 
lines  for  model  B. 


response  is  included.  The  mean  square  error  that 
is  Binisiized  in  this  case  is 


4.Non-narametric  probit  predictor.  In  the 

non-paraaietric  probit  nodel  E(Y)=F(q),  where  Y 

is  a  binary  response  variable,  F  is  the  probit 

function,  q  =  )  t.(x.)  and  t.  ,...,t  are 
L.  X  k  1  P 

arbitrary  saiooth  functions  of  the  covariates. 

We  would  like  to  estiaiate  the  functions  t 
that  aiiniaiize  the  weighted  square  residual  of 
the  probit  regression 

sK\  k-F<v]S  (4-1) 

i=l 

where  w=l/F(q) (l-F(q) ) .  This  can  be  accomplished 

as  follows.  Given  initial  estimates  q  ,  (4.1) 

o 

can  be  approximated  by 

“  f 

}  [yi-F<»i0)'f(ni0)}A<xk)]2Wio 

i=l  k=l 


n  p 

=X(VF<\o))/f(V-K(V]Vio 

i=l  k=l  ...(4.2) 

where  f  is  the  derivative  of  F.  Where 
^  A^q^-q^  is  the  correction  needed  to  update 
the  estimate  of  q,  and  where  w  =wf*(q). 

Now  the  problem  is  reduced  to  calculating 
A^'s  that  minimize  (4.2).  This  may  be 
accomplished  by  using  the  ACE  (Alternating 
Conditional  Expectation)  algorithm  of  Breiman 
and  Friedman  (1985)  with  the  adjusted  variable 
z=(y-F)/f  as  the  dependent  variable  fitted  by  a 
linear  transformation  and  with  weights  given  by 
wf*. 

It  is  to  be  noted  that  the  values  of  the 
corrections  A^,...A^  as  evaluated  by  ACE  are 
scaled  to  have  mean  zero  and  variance  one.  In 
order  to  obtain  the  updated  estimates  of  q  the 
coefficients  a,bj,...,b^  in  the  equation 

F 

vw 

k=l 

need  to  be  calculated.  This  is  done  by 
regressing  the  adjusted  variables  z  on 


5.  Semi-parametric  probit  model.  This 
model  extends  the  one  discussed  in  section  4  in 
that  the  extra  parameter  n  for  the  background 


n 

5  [yi-’,_(1_,I>F(ni)]2*j  •••  (5.D 

i=l 

where  y,  F,  q  and  w  are  as  defined  before. 

For  a  given  set  of  weights,  w^q,  end  probit 
predictors,  q^Q,  minimizing  (S.l)  with  respect 
to  n  yields 
n 

i=l 

n  - -  (5.2) 

n 

5(l-F(qi0»V0. 

i=l 

The  next  step  is  to  minimize  (5.1)  with  respect 

to  t.  (x.  ),...t  (x  )  given  rt.  This  is  the 
11  p  P 

same  as  minimizing  the  function 
n 

1  te-F(v]S  •••  (s*3) 

i=l 

where  now  y*=(yi-S) / (1-fi)  . 

The  function  (5.3)  is  of  the  same  form  as 

(4.1),  and  so  the  same  procedure  can  be  used  to 

obtain  estimates  for  t.,...,t  once  n  is 
1  P 

calculated  using  (5.2).  One  could  then  proceed 
by  iterating  back  and  forth  between  determining 
n  and  determining  q. 

6,  A  toxicological  example  with  two 
covariates .  Robertson  et  al.  (1981)  presented  a 
group  of  experiments  that  tested  the  effects  of 
weight  on  the  response  of  the  western  spruce 
budworm  (Choristoneura  occidentalis  Freeman)  to 
insecticides.  The  data  analyzed  in  this  section 
is  from  one  of  these  experiments  wherein  each 
insect  was  weighted  and  then  treated  with  a 
fixed  concentration  of  DDT.  The  response  of  each 
insect  to  the  chemical  was  recorded  after  7  days 
with  the  response  variable  y^l  If  the  insect 
was  dead  by  the  seventh  day  and  y ^*0 
otherwise.  Plots  of  the  binary  response  data 
versus  the  two  covariates,  dose  per  weight  and 
weight,  are  shown  in  figure  2. 
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Figure  2.  Mortulity  of  insects,  treated  with 
DDT,  versus  dose  per  weight  and 
weight.  The  area  of  the  circles  is 
proportional  to  the  nuaiber  of 
observation. 

The  objectives  of  this  study  were  two-fold. 

The  first  was  to  determine  whether  insects 

respond  to  toxicants  in  proportion  to  their  body 

weight.  The  second  was  to  decide  on  a 

mathematical  form  for  the  predictor  n  as  a 

function  of  dose  and  weight.  The  plots  in  figure 

3  are  the  transformed  variables,  and 

t„(x„)  versus  the  original  variables 

2  2 

aldose  per  weight  and  Xj=weight.  The 
initial  values  used  were  obtained  by  fitting  a 
parametric  probit  model  using  the  GI.IM  package. 
The  plots  in  figure  4  are  the  transformed 
variables  versus  the  logarithms  of  the  dose  per 
weight  and  weight  respectively.  The  nearly 
linear  shapes  of  these  graphs  suggest  that  a 
logarithmic  transformation  mi  ght  he  appropriate 
for  these  variables.  In  order  to  determine 


Wifht 


Figure!.  Non-paramet  r ic  t  ran s  format  ions  of 
covariates  for  f>DT  data. 

whether  both  variables  are  needed  in  the  model 
we  also  ran  a  probit  regression,  using  GLIfr, 
with  the  transformed  variables  as  covariates. 

The  resulting  scaled  deviances  for  tbe  models 
with  the  nonparamet r ic  functions  and  the  model 
with  a  parametric  predictor  using  a  logarithmic 
transformation  are  given  in  table  2. 

The  large  and  significant  decrease  in  the 
deviance  between  models  (A)  and  (R)  is  a 
subs  antial  indication  that  the  weight  covariate 
is  needed  in  the  model.  The  similarity  in  the 
deviances  for  models  (B)  and  ( C)  is  an 
indication  that  the  logarithmic  t rans f orma t i on 
is  most  probably  the  appropriate  t rans forma t ion 
to  be  used  for  these  covariates. 

In  cotcltsion,  it  was  found  that  the  commonly 
used  logarithmic  transformation  remains 
appropriate  when  consideration  is  extended  to  a 
broad  family  of  transformations. 
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Co^uiioi  of  paraaetric  ud  non-paruetric 
fits  of  the  sods!  F(h(*j,Xj))  with  ij*doie 
per  weight  ud  *j”woight. 


H  scaled  deviance 

d.f. 

A.  8.48,8^.,) 

228 

257 

B.  Bo+B;1t1(s1)+P2t2(x2) 

177 

256 

c.  80+81lo8(«1)-H>2log<x2) 

184 

256 

:  / 


a.  on 


800  1000 

ftwd^*  -  UfM 


Fisure  4.  Non— parametric  transformations  versos 
covsriates,  on  Iogarithaic  scale. 
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The  following  macros  are  used  together  with 
the  OWN  facility  of  GLIM  to  fit  the  parametric 
probit  model  of  section  2.  In  these  macros  lp 
represents  the  original  linear  predictor  £'ru 
and  Up  represents  the  modified  linear  predictor 
with  the  new  explanatory  variable  m. 

$c  GLIM  MACROS 
♦subfile  probit 
♦mac  fv 

♦switch  %a  mext 

♦calc  Ks=1  :  phe=Jnp( lp)»u: 


♦fac  tr  2  sex  2 

♦calc  lds.431t3*f  log(d)  (LOGARITHM  TEH  OF  DOSE 

♦calc  is£gl(Ut1)  :  u=fne(d,0) 

- invalid  function/operator  arguments(s) 


♦c  IGNORE  THE  INVALID  FUNCTION  ERROR  MESSAGES. 


♦yvar  r 
♦fit  Id 

♦ 


cycle 

4 


♦err  b  n  ♦link  p  ♦wel  u 
!  THIS  PRODUCES  INIT.  ESTIMATES  USING 
t  DATA  WITH  DOSE  GREATER  THAN  ZERO. 


scaled 


deviance 

df 

114.7 

18 

♦calc  lpsu^Up  :y=r/n 
♦yvar  y 


dpher  ( 1 -u ) +u#  ( ♦  ex  p(  - .  5*  lp##  2) /♦  sq  rt  ( 2H  pi ) ) 
♦calc  m(  i)  =  ( 1  -phe)/(  ( 1-Jpe(  a)  )*dphe) 

♦calc  t fv(i)*>pe{a)4.(  1->pe(a))*phe 
♦endm 

♦mac  dr  ♦calc  Jdr(i)=1/( ( 1-Hpe( a) )*dphe) 

♦endm 

♦mac  va  Scale  %vasffv*(  1->fv)/n 
♦calc  *va=*if(Ue(*va,0),.0000001  ,*va) 

♦endm 
♦mac  di 

♦calc  ♦dis-2,n»(*yv»Jlog(Kfv/Jyv)-» 

( 1  -Jy v) log(  ( 1  -t  fv) / <  1  -Hyv)  ) ) 

♦endm 

♦mac  mext  Sex tract  Jpe 
♦calc  lp(  i)  =<lp-Jpe(a)*m 
♦endm 
♦return 
! 

I 

♦c  A  GLIM  SESSION  TO  FIT  PROBIT  REGRESSION 
WITH  NON-ZERO  BACKGROUND  TO  INSECTICIDE 
DATA  IN  SECTION  3. 

♦calc  U=24  I  SETS  UP  SAMPLE  SIZE. 

♦units  Hi  ♦data  d  r  n  tr  sex  fREADS  IN  DATA, 
♦dlnput  7 ♦ 

♦  input  8  probity  !  READS  IN  OWN  MACROS, 

♦calc  a*1  Sfac  a  1  ISETS  UP  NUMBER  OF  DIFFERENT 
!BKG.  PARAMETERS  TO  ESTIMATE 


-  current  display  Inhibited 

♦calc  j(pe(  1)s.05  I  INIT.  VALUE  FOR  BKC. 

♦calc  Up=m=0  I  SETS  INIT.  VALUE  FOR  Up  AND  m. 
♦calc  b=1  ♦fac  b  1  I  A  DUMMY  FACTOR  NEEDED  TO 

!  HAVE  CONTROL  OVER  Jpe  ORDER 
I  IT  WOULD  NOT  BE  NEEDED  WERE 
!  'a'  NOT  A  FACTOR. 

♦own  fv  dr  va  dl 
♦wei  ♦ scale  1 

♦c  THE  FOLLOWING  FIT  IS  FOR  MODEL  B  OF  TABLE  1 . 
♦fit  a.m+sex.tr.u-fb.ld-JgmSd  et 

- invalid  function/operator  argument(s) 

scaled 


cycle 

deviance 

df 

5 

23.08 

18 

estimate 

S  16  • 

parameters 

1 

0.5305e-01 

0.1500e-01 

a(  1)  .m 

2 

1.438 

0.1685 

b(1).ld 

3 

1.056 

0.1393 

sex(  1 )  ,tr(  1 )  .u 

4 

-0.1006 

0.1123 

sex(  1 )  .tr(  2)  .u 

5 

1.071 

0.1401 

sex(2)  .tr(2)  .u 

6 

0.3486 

0.1141 

sex(2)  .tr(2)  .u 

scaled  parameter  taken  as  1.0000 
♦stop 


ON  THE  FITTING  AND  FORECASTING  OF  VECTOR  TIME  SERIES  MODELS 


B.L.  Shea,  Numerical  Algorithms  Group  Ltd.,  U.K. 


Let  Wt  =  (v,t,w2t 


1.  INTRODUCTION 
)' 


kt 


(t  =  1,2, ...,n). 


denote  a  vector  of  k  time  series  assumed  to  be 
jointly  stationary  and  generated  by  the  model 


»t  -  ♦iVi 

9q£t-q 


p  t-p  t  1  t-i 

(l) 


■•>£kt)’ 


where  tf^  =  W^-m  denotes  the  deviation  of  Wt 

from  its  mean  m  and  =  (£jt>e2t> 

(t=l ,2, . . . ,n) ,  denotes  a  vector  of  k  residual 
series  assumed  to  have  a  multivariate  normal 
distribution  with  zero  mean  and  positive 
definite  covariance  matrix  £  =  o2Q.  We  shall 
also  assume  that  E  (e  e  ')  =  0  for  t  4  s. 

(1)  is  called  a  vector  autoregressive-moving 
average  (VARMA)  process  of  order  (p,q). 

♦  =  (* j,*2> • • • > ♦p)  are  the  p  k  x  k  matrices 

of  autoregressive  parameters  and  e=  (8j  ,02, 

.  ..,8^)  are  the  q  k  x  k  matrices  of  moving 

average  parameters.  (1)  may  be  written  in 
the  state  space  form 


h  - 


Aa 


t-1 


Rtf 


t-1 


where  is  the  state  vector  of  length  kr  with 
r  =  maximum  (p,q)  and 


9,  I 

♦  1-  6, 

1 

82  I 

9, 

0 

A- 

9r-‘o 

r 

I 

,  R= 

• 

,  h= 

• 

0 

♦  -  9 
r  r 

0 

(Note  that 
Ut  atlt-l 


i  0  for  i  >p  and  0^=0  for  j  >q). 
denote  the  linear  minimum  mean 


data 
covariance 


square  error  (MMSE)  estimate  of  given 
up  to  time  t-1  and  j  the  covarianc 

matrix  of  this  estimation  error.  Similarly  let 
a^  denote  the  MMSE  estimate  of  a given  data  up 

to  time  t  and  o2Pt  the  covariance  matrix  of 

this  estimation  error.  Then  it  can  be  shown 
that  the  log  likelihood  function  is  given  by 
the  expression 

log  L  (*,0,y,o2,Q)  -  -nk  log  (2ito2)  - 
2 


n 

1  log  n  | 

2  t-1 


-  1 


2o2t=l 


V*F-*V 
t  t  t 


(2) 


For  t-l,2,...,n  the  one  step  ahead  prediction 
errors  V^.  and  the  corresponding  covariance 


matrices  o2  Ft  =  E(VtV^)  are  generated  by  the 
recursive  equations 

=  Aa_  .  +  Rtf^ 


“t  1 1-1 
Vt 

Pt | t-1 
F^ 


t-1 

=  tf  -h  ^ 

=  APt-l 

=  h'P 


t-1 
t|  t-1 


u  t|t-lh  +  9 

at  =  at| t-1  +  Pt|t-lhFt  Vt 

Pt  =  Pt| t-1  '  Pt|t-lhFt  h 'Pt| t-1 
Starting  values  are  given  by  setting  a^^  =  0 
and  calculating  Pj | Q  using  the  method 

described  by  Jones  (1980).  Ansley  and  Kohn 
(1983)  discuss  a  similar  procedure  for  a 
different  state  space  representation  which  is 
less  efficient  whenever  p$  q.  If  we  have  no 
missing  observations  then  the  following 
recursive  equations  should  be  used  (see  Shea 
(1986)) 


=  *t  -  ha't|t-l 


“t | t-1 


Tat-1 I t— 2  +  Kt-1  Ft-1 


=  K. 


t-1 


F  =  F. 


t-1 


TLt-l  Mt-1 

hVl  Mt-1 


(h'L 


t-1 

)' 


t-1 

(h,Lt-l>' 


Mt  =  Mt-1  -  Vi  (h’Lt-i>'Ft'lh'Lt-iMt-i 

Lt  -  TLt-l  -  Kt-1  F^-l  h’Lt-l 

where  T  =  A+Rh'  is  just  the  matrix  A  with  8's 
replaced  by  O's.  Initial  conditions  are  given 


by  ajig  -  0,  Kj  =  Lj  =  TP 


1|0" 


m, 


Fl=h'P,|0h+9>  M1  . 

M^  (k  x  k)  and  Lt  (kr  x  k)  unlike  o2Ft  and  the 

well  known  Kalman  gain  matrix  o2Kt  =  E(at+jV^) 

have  no  physical  interpretation  as  covariance 
matrices. 

For  a  stationary  (and  invertible)  process 
E  ((V^  -  £t)(Vt  -  £t)')  tends  to  zero  as  t 


becomes  large. 

MMSE  estimates  of  the  residual  series. 


Thus  the  V^'s  are  the  linear 


o2  (which  is  typically  taken  to  be  the  top 
left  hand  element  of  £  so  that  Q  (1,1)  =  1) 
can  be  differentiated  out  of  (2)  to  yield  a 
concentrated  likelihood  function  which  can  be 
re-arranged  as  a  sum  of  squares.  Thus 
maximising  (2)  can  be  shown  to  be  equivalent 
to  minimising 


n  Vnk  n  '  -1 

(n|Ftl)  /nk  r  vtFt'  vt 


(3) 


To  avoid  problems  of  underflow  or  overflow  in 
calculating  |  |Ftl  the  product  can  be  stored  in 

the  form  a2**  (Martin  and  Wilkinson  (1965))-  It 
follows  that  if  the  model  (1)  is  stationary 
then  the  F  's  will  be  positive  definite  and 
1  -1 

thus  in  calculating  F  the  Choleski 
factorization  of  (CtC£  with  lower 

triangular)  will  be  obtained  as  a  by-product  of 
the  inversion  process.  Thus  we  have 
n  l/nl, 

(  n  | F  | )  /nk  =  D*(n,k) 

I  n  k 

where  D(n,k)  =  n  n  C  (i,i) 
t=l  i=l 

If  we  let  v*  =  D(n,k)C~1V.  and  &.  , .  denote 

t  t  t  (t-l)k+j 

the  jth  component  of  v~  (t=l,2, . . . ,n,  j=l,2,..., 

k)  then  (3)  reduces  to 
nk 


Computation  of  C^V^  is  speedily  carried  out 

using  back  substitution.  A  non-linear  least 
squares  algorithm  such  as  that  of  Marquardt 
(1963)  may  be  used  to  search  for  the  maximum 
likelihood  estimates  of  ♦,  e,  Q  and  u  .  Using 
such  an  algorithm  has  an  advantage  over  just 
using  a  general  purpose  optimization  routine 
in  that  such  routines  are  numerically  more 
stable  and  generally  converge  to  the  minimum 
more  quickly.  Another  advantage  is  that  a 
reliable  estimate  of  the  Jacobian  matrix  for 
calculating  asymptotic  standard  errors  of 
parameter  estimates  is  usually  obtained  as  a 
by-product. 


2.  FINITE  SAMPLE  PREDICTION 

MMSE  forecasts  of  future  series  values  are 
easily  computed  as  follows.  Let  us  assume  we 
wish  to  forecast  from  time  origin  n  and  let 

an+i|n  =  E(an+1|Wj,W2,...,Wn)  and  o*P  n  = 


E((a  -a  )(a  , 

n+1  n+l|n  n+1 


a  )•).  If  »  (1) 
n+l|n  n 


denotes  the  linear  MMSE  estimate  of  Wn+^  given 
. then  «n(l)  =  E(Wn+1|W1,V2,...,Wn) 

=  h'a  ...  It  follows  that  o*F  . 
n+l|n  n+1 

=  oJ(h'P  .  h  +  Q),  called  the  mean 
n+l)n 

square  error  of  prediction  matrix 

=  E{(wn  -  0  ( 1))  (W  .-$(1))'}.  Since 
n+l  n  n+l  n 


a  , -  Ta  ,  ,  +  Re  ,  ,  and  h'a  , 
n+l  n+1-1  n+1-1  n+1 


n+1 


n+1 


conditional  on  W1,W2,...,Wn 

an+l In  =  Tan+l-l|n  +  R  E(£n+1-1 |Wj ,W2 » *  *  * » 


V 


Aa  .  +  RW  , 

n|n  n  ’ 


Ta 


n+1-1 |n  ’ 


1  =  1 
li2 


We  also  have 

Pn+l|n  =  APn|n  A' 

.  1  =  1 

-  TPn+l-l|n  T' 

+  RQR1 ,  1>.2 

a  ,  and  P  ,  are  easily  recovered  from 
n|n  n|n  J 

equations 

an|n  “  an|n-l  +  Pn|n-1 

hF  ‘v 
n  n 

Pn|n  ~  Pn|n-1  Pn|n-1 

hF_1h’P  .  . 

n  n|n-l 

n-l 

with  P  .  ,  =  P, .-  +  £  L.M.L'. 

n |n-l  1|0  j  j  j 

Probability  limits  for  forecasts  are  calculated 
as  follows. 

Ut  Vn+1  =  Wn+1  *  h'an+l|n  '  then 
V-i  aN(0,o,Fn+i ) 


n+1 
n  F_1 


Now  1  I  V'  F  V  has  a  X*  distribution  (on 
,  t  t  t 

nk  degrees  of  freedom)  independent  of  Vn+^, 

Suppose  interest  centres  on  the  jth  time  series. 
Let  denote  the  vector  Vn+1  where  the  jth 

and  kth  components  of  have  been  inter¬ 

changed.  Also  let  Fjj+j  denote  the  matrix 

F  ,  with  the  jth  and  kth  row  and  column 
n+  1 

interchanged.  If  C*+^  C*|^  is  the  Choleski 
decomposition  of  F*+j  with  Cj^lower  triangular 
then 

1  C*'  ,  Vs  .  *  N(0,I ) 

-  n+1  n+1 

0 

Let  di:  ,  be  the  (k,k)th  element  of  C*  .  ,  then 
n+1  n+1 

d"  ,  (W( -  W(j>  (1)  )-  t 


n+1  n+1  n 


nk 


;(j) 


so  that  W^J ' ( i)t£  t,  a  nk  are  lOO(l-o)  % 


n+1  1 

probability  limits  for 
W 


where  o*  -  1  £  V>  'v 


n+  1 


1  t  t  t 
nk  1 


we  have  on  taking  expectations 
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PREDICTION  INTERVALS  FOR  THE  GANNA  DISTRIBUTION 


Mel-Kel  Shlue,  Southern  Illinois  Unlverslty-Edwardsvllle 
Lee  J.  Bain,  University  of  Mlssourl-Rolla 


ABSTRACT 


PCLjKXi . Xn;a)  <  Y]  =  1  -  a 


Approximate  prediction  Intervals  for  a  single 
future  observation  or  for  the  average  of  m  future 
observations  are  developed  for  the  two-parameter 
gamma  distribution  where  both  parameters  are 
considered  unknown.  The  methods  are  illustrated 
with  three  examples. 

INTRODUCTION 

the  Pearson  Type  III  or  gamma  distribution  Is 
a  classical  distribution  which  provides  a  useful 
model  in  many  fields  of  application.  Statistical 
methods  for  this  distribution  when  both 
parameters  are  unknown  have  been  slow  to  be 
developed,  primarily  because  the  parameters  are 
not  In  the  convenient  location-scale  form. 
Consider  the  gamma  density  function  given  by 

g(x;e,K)  =  — - —  xK'1e'x/e,  x  >  0;  *,  e  >  0. 

r(*)eK 

The  mean  is  u*  E(X)  *  k8.  The  parameters  k  and  8 
are  referred  to  as  shape  and  scale  parameters, 
respectively.  Optimum  tests  for  8  with  k  as  an 
unknown  nuisance  parameter  are  derived  by 
Engel hardt  and  Bain  (1977),  based  on  the 
conditional  distribution  of  "X  given  X,  In  which  X 
and  X  denote  the  arithmetic  and  geometric  sample 
means  respectively.  Tests  for  k  with  e  unknown 
may  be  based  on  the  maximum  likelihood  estimator 
(m.l.e.)  k  or,  equivalently,  on  S  *  ln(X/X). 
Bain  and  Engelhardt  (1975)  provide  approximate 
distributional  results  for  S.  Grice  and  Bain 
(1980)  provide  approximate  tests  or  confidence 
limits  for  the  mean  when  both  parameters  are 
assumed  unknown,  and  this  method  Is  extended  to 
the  two-sample  case  by  Shlue  and  Bain  (1983). 
Some  related  discussion  concerning  tolerance 
limits  is  given  by  Bain,  Engelhardt  and  Shlue 
(1984).  The  purpose  of  this  paper  is  to  extend 
these  results  to  obtain  prediction  Intervals  for 
a  future  observation  or  for  the  average  of  m 
future  observations. 

Suppose  xi,...,xn  denotes  a  random  sample  of 
size  n  from  a  gamna  distribution,  then  a  lower 
l-o  level  prediction  limit  for  a  future 
observation,  y.  Is  a  function  of  the  sample,  say 
Ly  =  Ly ( xi .... »xn ,a ) ,  such  that 

PtLy(X! . Xn;a)  <  Y]  •  1  -  a. 

It  can  also  be  shown  (In  general)  that 

P(Ly(xi . x„;a)  <  Y]  -  E[1  -  Gy(Ly)] 

*l-o, 

and  therefore  a  1  -  «  level  prediction  Interval 
Is  also  a  (1  -  o)-expectat1on  tolerance  Interval^ 
A  lower  1  -  a  level  prediction  limit  for  y, 
the  average  of  m  future  observations,  would  be 
Ly( xi ... . ,xn to)  where 


An  upper  1  -  a  level  prediction  limit  Is  obtained 
by  replacing  1  -  a  with  a, 

Uy(xi,...,xn;o)  *  Ly(xi,...,xn;l  -  a). 

A  prediction  limit  for  the  total  of  m  future 
observations,  such  as  the  total  amount  of 
rainfall,  may  be  useful.  In  reliability 
applications  a  lower  prediction  limit  for  the 
total  operating  time  which  will  be  realized  from 
a  component  with  m  -  1  spares  would  be 
meaningful,  for  example.  It  Is  clear  that  for  a 
total  T  *  m  Y, 

Lt(*1 . xnio)  *  m  Ly(xi,...,xn;o). 

APPROXIMATE  PREDICTION  INTERVALS 

For  a  random  sample  of  size  n  from  a  gamna 
distribution 


For  m  future  observations,  yi,...,yffl, 

-  XZ(2mK) 

and  it  follows  that 

—  -  F ( 2pk ,  2m<) 

Y 

where  F(a,b)  denotes  Snedecor's  F-distribution 
with  a  and  b  degrees-of-freedom.  Note  that 
letting  m  *  1  gives  the  important  special  case  of 
a  single  future  observation,  y  *  y. 

Prediction  intervals  could  be  easily  computed 
if  k  were  known  since,  for  example. 


P[  -  <  f,  (2nic,  2mic)l 

y  ”  i  “Qt 

*  P[Y  >  X/f!_a  (2n«,  2mic)] 


*  1  -  a, 

and  !T/fi.a(2nic,  2mx)  would  be  a  lower  1  -  a  level 
prediction  limit  for  Y.  Following  the  procedure 
of  Grice  and  Bain  (1980),  the  unknown  k  Is 
replaced  by  the  m.l.e.  k  ,  and  then  the 
probability  level  actually  achieved  Is  studied. 
The  true  probability  level  In  this  case  will  be  a 
function  of  <  ,  and  it  may  differ  substantially 
from  the  nominal  level  for  small  n,  but  it  Is 
again  found  that  the  achieved  level  Is  fairly 
constant  over  k.  Thus,  It  Is  possible  to  adjust 
the  Initial  level  to  more  nearly  achieve  the 
desired  level  when  £  Is  used.  That  is,  we 
propose  a  lower  prediction  limit  of  the  form 


where  8  is  adjusted  to  give  approximately  the 
correct  1  -  a  .level.  The  good  rational 
approximation  for  k  given  by  Greenwood  and  Durand 
(1960)  Is  used  in  the  study,  where  M  =  ln(X/X) 
and 

.  .5000876  +  .  1648852M  -  .0544274M2 
H - > 

0  <  M  <  .5772, 


*  =  8.893919  +  9.059950M  +  .9775373M2 

11(17.79728  +  1 1 . 968477M  +  M2) 

.5772  <  M  <  17, 

*  =  1/M 

M  >  17. 


Now,  let 

“1  =  Pi(k,S)  *  1  -  PIT  <  X/f„(2nic,  2m<)] 

P 

and 

“2  =  P2(K-6)  =  1  '  PLX/fi-3(2n^,  2nw)  <  Y], 


then 


Uy(xl,...,xn;a1) 


=  x/fg  (2r»c ,  2m>c) 


and 


Ly(xi . xn;a2)  *  y/fi-e(2ntc,  2m«). 

The  values  of  t»i  and  03  approach  0  for  large  n, 
but  as  noted  they  differ  from  g  and  depend  on  *■ 
for  small  n.  Since  the  dependence  on  the  unknown 
k  is  small,  our  approach  is  to  determine 
guidelines  for  selecting  a  value  of  g  which  will 
approximately  yield  a  desired  specified  <*  value. 

Values  of  aj  and  02  were  estimated  by  Monte 
Carlo  simulation  for  several  g  values  over  a 
range  of  <  values,  and  the  results  are  given  In 
Table  1.  The  Monte  Carlo  values  are  based  on 
10,000  gamma  variates  generated  using  the  IMSL 
subroutine  GGAMR.  Asymptotic  values  were  derived 
mathematically  for  the  k  *  0  and  k  =  »  lines  in  a 
manner  similar  to  that  followed  In  the  papers 
cited  earlier.  In  particular,  the  <c  =  <•  values 
are  the  same  as  those  obtained  by  Grice  and  Bain 
(1980).  The  k  *  0  values  may  be  obtained  In  a 
manner  similar  to  that  used  In  Shiue  and  Bain 
(1983)  except  <  is  now  based  on  only  a  sample 
size  n.  In  this  case  letting 


d  =  — —  and  1  -  d  =  — y— ,  we  have 
n  +  m  n  +  m’ 


Oj  =  Pj(0,e)  = 


(l-d){l  -  J  1 nC 6/ { l-d)l}"n+1 , 
6  <  1  -  d 

1  -  d(l  -  i  lnt ( l-8)/d3 }*n+1 , 
B  >  1  -  d 


and 


d{l  -  £  lnte/dD}*n+1,  e  <  d 


a2  =  P2(0,B)  =< 


l-(l-d){l  -  I  ln[(l-B)/(l-d)3rn+11 


B  >  d. 


We  now  observe  from  Table  1  that  the  actual 
levels  are  approximately  constant  over  the  entire 
range  of  possible  *  values  even  for  moderately 
small  sample  sizes  of  10,  or  20.  The  Monte  Carlo 
values  may  not  be  totally  accurate  to  the  three 
digits  shown  In  the  table,  but  they  show  the 
general  change  in  the  a  values  between  the 
limiting  values  at  k  «  0  and  k  *  «. 

Thus  for  an  initial  6  value,  entering  Table  1 
with  k  *  <  would  give  a  close  estimate  of  the 
actual  a  value  associated  with  the  prediction 
limit.  It  would  ordinarily  be  more  helpful  to 
know  what  initial  0  value  is  necessary  to 
approximately  provide  a  specified  a  level.  As 
suggested  in  the  earlier  cited  references,  the 
simple  procedure  of  inverting  the  Infinity  values 
is  again  recommended.  Table  2  gives  the  value  of 
0  which  yields  pi(»,0)  ■  a  for  the  commonly  used 
values  of  a,  and  this  value  of  0  Is  then  used  to 
compute  the  prediction  limits.  For  n  >  40 
interpolation  on  1/n  may  be  followed.  Note  that 
P1(*».b)  *  P2(“,0  ).  and  also  that  the  infinity 
values  do  not  depend  on  m,  so  that  only  a  small 
simple  table  is  required.  This  simple  adjustment 
should  be  adequate  for  practical  applications  for 
any  range  of  k  values.  The  approximate 
probability  levels  are  less  accurate  at  very 
small  sample  sizes,  but  the  Inherent  sampling 
variation  will  be  relatively  larger  for  small 
samples  and  this  cause  for  lack  of  precision  in 
the  results  will  generally  be  of  relatively 
greater  importance  than  the  small  inaccuracies  in 
the  stated  probability  levels. 

Use  of  Table  2,  of  course,  gives  more  accurate 
results  with  problems  concerning  larger  k  values. 
For  example,  in  reliability  problems  in  most 
cases  k  >  1,  since  k  >  1  corresponds  to  having  an 
increasing  failure  rate  with  age.  In  other 
applications  small  k  values  may  sometimes  occur, 
and  with  very  small  *  and  small  n  a  closer 
approximation  would  be  obtained  by  inverting  the 
Pi(O,0)  values.  These  values  depend  on  i  and 
but  this  is  not  a  difficulty  since  they  can 
inverted  In  closed  form.  For  specified  oj  or 
we  have 


m, 

be 

“2 


Bj  =  (l-d)exp{n[l  -  (y^)1/^'n+1^},  aj  <  1  -  d 


8j  =  1  -  d  expinCl  -  ( — n  ^3),  aj  >  1  -  d. 


8?  =  d  exp  <nCl  -  (-^)  1'/(-n+D])>  «  d> 


l-Oo 


62  =  1  -  (l-d)exp{nCl  -  ( j-_j2)  ^  n  ^3},  >  d. 


where  d  -  n/(n+m). 

Improved  results  are  obtained  if  the  above 
values  are  used  for  problems  concerning  small 


8 


406 


values  of  k,  say  If  £  <  1,  and  Table  2  is  used  if 
£  >  1.  Note  that  @i  *  “i  when  m/(n+m)  =aj,  so 
little  adjustment  of  the  nominal  level  is  needed 
in  this  general  range  of  values  for  small  <  . 
This  primarily  applies  to  the  case  of  upper 
prediction  limits  for  a  single  future 
observation. 

It  is  clear  that  the  outlined  procedure  will 
usually  require  interpolation  on  both 
degrees-of-freedom  and  the  probability  level  in 
the  F  tables.  This  is  inconvenient,  but  it  can 
be  carried  out  in  a  few  minutes  on  a  calculator. 
Also  using  the  nearest  integer  degrees-of-freedom 
should  be  acceptable  in  most  cases  at  least  for 
preliminary  work.  It  Is  also  possible  to  obtain 
a  degree-of- freedom  less  than  1  if  say  m  =  1  and 
ic  <  .5,  which  would  require  numerical  integration 
or  some  approximation  to  obtain  the  critical 
value.  The  reader  may  refer  to  Pearson  (1968) 
for  more  accurate  interpolation  procedures. 
These  inconveniences  would,  of  course,  be  removed 
if  the  procedure  is  computerized.  It  may  also  be 
worth  noting  that  by  the  inherent  nature  of  the 
problem  one  will  necessarily  have  very  wide 
intervals  for  the  m  =1  and  small  <  case.  For 
example  if  the  parameters  are  assumed  known  with 
<  =  .5  then 


P[(<V2)x2.o25(D  <  T  <  (®/2)x2. 975(1)]  *  .95, 
which  gives  the  interval  (.00050,  2.510). 


NUMERICAL  EXAMPLES 


Mielke  and  Johnson  (1974)  consider  a  gamma 
model  for  the  following  accumulated  streamflow 
data  from  a  U.S.  Geological  Survey  station  in 
Colorado. 


In 

two-sided 


46.65 

29.96 

25.49 

11.85 

41.01 

23.64 

30.90 

19.51 

9.06 

41.06 

57.04 

30.93 

15.94 

29.70 

37.51 

38.78 

17.25 

50.80 

31.93 

15.31 

47.11 

75.24 

25.39 

14.69 

18.54 

45.80 

39.64 

38.14 

53.93 

39.84 

14.40 

28.24 

this 

case  < 

=  4.5 

and  x  =  32. 

prediction  interval  for  a 


A  90S 
future 


accumulated  streamflow  reading  is  given  by 


(  Ly  (  X  J  ,  .  .  .  iXnJOtg)  ,  Uy(  X|,...|XpiQj)) 

=  (x/fi_e(2n<,2n>K),  x/fg(2nK,  an*)). 


EXAMPLE  2 


Lieblein  and  Zelen  (1956)  present  n  =  23 
values  of  the  endurance,  in  millions  of 
revolutions,  of  deep-groove  ball  bearings.  A 
gamma  distribution  is  suggested  as  an  appropriate 
model  by  Bain  and  Engelhardt  (1980).  For  this 
data  x  =  72.22,  x  *  63.46  and  k  -  4.025.  Suppose 
a  lower  90S  prediction  limit  is  desired  for  the 
endurance  of  such  a  bearing.  We  have  o  =  .10, 
n  =  23,  m  =  1  and  B  *  .088  from  Table  2.  Thus 


Ly(xi«...  ,xn;.10)  =  x/fi-e(2n>c,  2mic) 
«  72.22/f  912(185.2,  8.05) 

=  72.22/1.77 


*  40.8. 


Suppose  a  lower  90S  prediction  limit  is 
desired  for  the  total  life  time  of  a  bearing  and 
2  spares.  We  have  m  =  3, 


Lyt  XJ  ,.  .  .  ,Xf|  i  .  10) 

=  72. 22/f ,912( 185.2,  24.15) 
f  49.47, 


and  a  lower  prediction  limit  for  the  total 

lifetime  of  3  bearings  is 

Ly(xi . xn; . 10)  =  3(49.47)  =  148.4. 


EXAMPLE  3 

Crow  (1977)  considers  a  gamma  model  for  hail 
data  measured  by  hail /rain  separators  which  were 
reported  in  Crow,  et.  al.,  (1976).,  For  17  seeded 
days  he  obtains  the  estimates  *  =  .466  and 
3f  =  13.249.  Suppose  a  lower  95*  prediction 
interval  is  desired  for  the  total  amount  of  hail 
measured  on  five  days.  We  have 


Ly(x], 


■  •  »xn'»a2)  =  X7fl-e2(2n<,  2m<), 


where  a2  =  .05,  n  =17,  m  =  5,  d  =  17/22  =  .773, 

and  82  =  .773 

exp{17[!  -  [pry 


]}  =  .035. 


Ly(xj . xn;.05)  =  13.249/f.g65(15.84,  4.66) 

=  13.249/6.17 


where  aj  =  a2  =  ,05.  n  =  32,  m  =1,  and  from 
Table  2,  M  .042.  Thus 

( Ly( xi,... ,xn; .05) ,  Uy( xj . xn ; .05) ) 

=  (32.67/f  958( 288,9) ,  32.67/f  . 042( 288,9) 

=  (32.67/2.93,  32. 67/. 504) 

=  (11.2,  64.8). 

In  reliability  applications  this  data  could 
represent  the  times-to-failure  of  a  certain  type 
component,  and  one  would  wish  to  predict  the 
tlme-to-failure  of  a  new  component  being  placed 
in  service. 


=  2.1. 

For  the  5  day  total,  T  =  5 y,  and  a  95*  lower 
prediction  limit  for  T  is  5(2.1)  =  10.5. 
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Abstract 

This  paper  describes  computer-assisted  survey 
methods  that  are  being  planned,  tested,  and  imple¬ 
mented  by  the  Statistical  Reporting  Service  (SRS) 
of  the  U.S.  Department  of  Agriculture  for  its  44 
data  collection  offices  across  the  United  States. 
The  major  activities  include  developing  a  Data 
Management  System  along  with  Computer-Aided  Samp¬ 
ling  Frame  Maintenance,  Computer-Aided  Survey 
Management,  and  Computer  Assisted  Telephone  Inter¬ 
viewing  Systems. 

Introduction 

The  Statistical  Reporting  Service  (SRS)  admin¬ 
isters  the  United  States  Department  of  Agricul¬ 
ture's  program  of  collecting  and  publishing  cur¬ 
rent  national  and  state  agricultural  statistics. 
SRS  is  totally  dependent  upon  computer  technology 
to  carry  out  this  service.  The  following  para¬ 
graphs  outline  characteristics  of  the  SRS  survey 
and  estimating  program  that  affect  its  use  and 
development  of  computer  methodology. 

Estimates  for  about  120  crops,  45  livestock, 
and  50  farm  economic  items  are  published  in  about 
300  national  reports  each  year.  In  addition, 
estimates  for  many  of  these  items  are  also  pub¬ 
lished  at  the  state  and  county  level.  The  reports 
vary  in  frequency  by  item,  but  the  frequency 
varies  from  weekly  to  monthly,  to  quarterly  and 
annually. 

The  dynamic  agricultural  markets  rely  upon 
very  timely  information.  Data  collection  periods 
are  generally  limited  to  10-15  days  with  the  pub¬ 
lished  estimates  following  within  2-3  weeks.  To 
illustrate,  six  major  surveys  that  included  about 
150,000  farm  operators  were  conducted  during  the 
December  1  -  January  10  time  period.  Reports 
were  published  as  early  as  December  22  and  all 
were  published  by  February  10.  The  actual  release 
dates  for  all  reports  is  announced  about  one  year 
in  advance. 

The  need  for  timeliness  has  been  met  by  devel¬ 
oping  parameter  driven  application  programs  for 
data  edit,  analysis,  and  summary.  The  data  col¬ 
lection  and  data  capture  activities  are  distrib¬ 
uted  to  44  State  Statistical  Offices.  However, 
the  current  mode  of  data  processing  is  batch 
oriented.  The  44  data  collection  offices  communi¬ 
cate  with  the  Washington,  D.C.  headquarters  over 
a  leased  communications  network.  The  IBM  main¬ 
frame,  used  for  the  majority  of  presurvey,  survey, 
and  post-survey  processing,  is  also  leased  and  is 
located  in  Orlando,  Florida. 

To  improve  the  timeliness  and  quality  of  the 
agricultural  data  that  SRS  collects,  this  paper 
will  describe  the  four  systems:  Data  Management 
System,  Computer-Aided  Sampling  Frame  Maintenance 
System,  Computer-Aided  Survey  Management  System, 
and  Computer-Assisted  Telephone  Interviewing 
(CATI)  that  will  create  an  on-line  real  time 
processing  environment. 


Data  Management  System 

The  Data  Management  System  affects  all  areas 
of  SRS's  work.  It  includes  the  basic  sampling 
frames,  raw  survey  data,  sample  estimates  and 
their  measures  of  error,  and  administrative  data, 
such  as  budgets,  salaries,  equipment  inventories, 
etc.  It  also  includes  all  of  the  functions  that 
SRS  uses  to  design,  implement,  collect,  and  pub¬ 
lish  survey  data.  Figure  1  illustrates  the  re¬ 
lationship  between  the  sets  of  data  SRS  uses  to 
conduct  its  day-to-day  operations  and  functional 
activities. 

The  major  feature  of  the  data  management 
system  is  that  it  will  efficiently  integrate  the 
data  sets  with  the  functional  activities.  To 
implement  this  integration,  the  data  management 
system  development  is  divided  into  three  parts. 

These  include  the  data  dictionary/data  directory; 
i.e.,  the  metadata  for  SRS,  the  logical  design 
and  physical  implementation  of  the  data  base 
management  system,  and  the  applications  programs. 

A  data  base  management  system  called  ADABAS  and  a 
4th  generation  language,  called  NATURAL,  for 
application  program  development  are  being  used. 
Instead  of  developing  the  data  management  system 
for  all  activities  at  once,  the  project  has  been 
subdivided  into  subject  matter  areas.  These 
areas  will  be  modeled  to  describe  the  data  ele¬ 
ments  required  by  the  data  users  and  to  define 
the  relationships  between  the  data  elements. 

Table  1  lists  the  16  subject  matter  areas.  Work 
has  begun  on  the  budget  model  and  the  equipment 
and  supply  model.  As  application  programs  are 
being  developed  for  these  two  models,  modeling 
will  begin  for  specialty  crops. 

The  remainder  of  this  paper  discusses  three 
functional  activities  associated  with  the  data 
management  system;  viz.,  computer-aided  survey 
management,  computer-aided  sampling  frame  mainte¬ 
nance,  and  computer-assisted  telephone  interviewing. 

Computer-Aided  Sampling  Frame  Maintenance 

SRS  maintains  data  on  three  frames:  A  list 
frame,  area  frame,  and  release  frame.  The  first 
two  frames  are  used  for  survey  design,  data  col¬ 
lection  and  analysis.  The  release  frame  contains 
information  on  nonfarmers  that  should  receive 
survey  results.  The  list  frame  contains  1.8  mil¬ 
lion  records.  The  area  frame  contains  over  65,000 
records  and  the  release  frame  contains  over  50,000 
names  nationwide.  Of  course,  by  data  collection 
site  the  number  of  records  may  vary  from  less 
than  10,000  to  more  than  100,000.  All  of  the 
processes  associated  with  frame  maintenance  are 
currently  conducted  in  a  batch  environment  with 
transactions  being  hand  coded  to  forms  and  those 
forms  then  being  key  entered.  The  computer-aided 
frame  maintenance  activities  will  allow  SRS  to 
search  and  display  records,  add  new  records,  and 
maintain  and  change  records  in  an  on-line  envi¬ 
ronment.  Search  and  display  will  be  used  for 


on-line  overlap/nonoverlap  determinations  and 
duplicate  record  detection.  Improved  overlap/ 
nonoverlap  determination  procedures  reduce  a 
major  source  of  nonsampling  errors  In  dual  frame 
sampling  (Vogel,  1975).  Improved  detection  of 
duplicate  records  reduces  the  possibility  of  dup¬ 
licate  records  existing  in  the  list  frame.  This 
ensures  that  correct  probabilities  of  selection 
are  used  for  the  list  frame  estimators.  Duplicate 
records  can  occur  because  many  farms  and  ranches 
are  operated  as  partnerships  or  have  operation 
names  that  can  occur  in  the  frame  along  with  Indi¬ 
vidual  names.  The  add  record  function  will  be 
used  to  add  new  records  to  the  frames.  Examples 
of  the  use  of  this  function  include  adding  the 
new  names  associated  with  the  20  percent  rotation 
of  the  area  frame  and  the  continual  addition  of 
new  names  to  the  list  frame  as  new  sources  of 
farm  and  ranch  names  are  found.  The  maintenance 
and  change  function  Includes  updating  the  names 
and  addresses  of  farmers  and  ranchers  as  well  as 
adding,  deleting,  or  changing  the  auxiliary  Infor¬ 
mation  describing  a  farm  operation  such  as  number 
of  acres,  cattle  and  hog  inventories,  etc.,  used 
for  sampling  purposes. 

All  of  the  data  used  in  this  system  will  be 
accessible  on-line.  Besides  the  name,  address, 
and  phone  number  of  each  farm  or  ranch,  it  will 
include:  Up  to  100  Items  of  auxiliary  Informa¬ 
tion  for  sampling  purposes;  Identifiers  Indicat¬ 
ing  the  surveys  in  which  the  unit  was  selected; 
information  about  the  publications  the  sample 
unit  should  receive;  alternative  names  for  the 
sampling  unit;  and,  finally,  a  "coiments"  section. 
This  system  is  also  being  implemented  with  ADABAS. 

Computer-Aided  Survey  Management 

As  described  in  the  Introduction,  SRS  has  a 
critical  need  to  keep  track  of  and  be  able  to 
report  the  status  of  several  surveys  being  con¬ 
ducted  either  simultaneously,  or  with  consider¬ 
able  overlap  in  data  collection  activity.  The 
Computer-Aided  Survey  Management  system  will  meet 
this  need.  This  system  will  provide  ad  hoc  in¬ 
quiries  about  the  survey  process  or  produce  aggre¬ 
gate  reports  describing  survey  status. 

Information  the  system  will  produce  will  in¬ 
clude  reports  about  the  location  of  sample  units 
by  geographic  area— useful  in  interviewer  assign¬ 
ments.  Other  reports  will  categorize  sample 
units  by  surveys— useful  in  grouping  question¬ 
naires  that  have  to  be  completed  by  the  same  re¬ 
spondent  for  several  simultaneous  surveys. 

Reports  that  describe  sample  units  by  mode  of 
data  collection—  mail ,  telephone,  and  face-to- 
face— will  also  be  produced.  And,  finally, 
reports  on  survey  status  from  presurvey  activities 
through  data  collection  and  data  edit  will  be 
produced  to  help  each  field  office  manage  the 
survey  process. 

A  prototype  system  has  been  implemented  on 
PC’s  using  a  DBASE  III.  It  will  eventually  be 
integrated  into  the  main  frame  environment  and 
ADABAS. 

Computer-Assisted  Telephone  Interviewing 


the  questionnaire  displayed  on  a  cathode  ray  tube 
and  respondents'  answers  are  keyed  directly  into 
the  machine  for  editing  and  retention.  SRS  has 
been  working  jointly  with  the  University  of 
California  at  Berkeley  since  1980  to  test, 
develop,  and  implement  the  Berkeley/USDA  CATI 
system.  1/  SRS  currently  has  11  field  offices 
operational  using  CATI  on  super  micros  operating 
in  a  UNIX  environment.  2/  Late  this  summer,  four 
more  field  offices  wiH-receive  the  hardware 
necessary  for  CATI  operations.  The  major  develop¬ 
ment  effort  in  this  area  for  SRS  will  be  the 
testing  and  implementation  of  CATI  in  a  MS-DOS 
environment  using  a  Local  Area  Network  (LAN). 

The  CATI  software  has  been  successfully  loaded  on 
a  LAN.  Operational  testing  will  begin  this 
summer. 


Summary 

The  distributed  processing  requirements  de¬ 
scribed  above  are  essential  to  operate  an  effi¬ 
cient  statistical  organization.  To  improve  the 
timeliness  and  quality  of  agricultural  data  in 
the  U.S.  the  Statistical  Reporting  Service  is 
testing,  developing,  and  implementing  new  systems 
to  process  in  an  interactive  environment.  These 
include  a  Data  Management,  Computer-Aided  Frame 
Maintenance,  Computer-Aided  Survey  Management, 
and  Computer- Assisted  Telephone  Interviewing 
Systems . 


Table  1 .  -  Subject  Matter 

Field  Crops 
Specialty  Crops 
Livestock 
Dairy 
Poultry 
Prices  Paid 
Prices  Received 


Areas  for  Data  Modeling 

Labor,  et  al 
Area  Frame 
List  Frame 
Release  Frame 
Personnel 
Budget 

Equi pment/Suppl i es 


Footnotes 


1 J  For  a  more  complete  description  of  the  SRS 
CATI  environment,  see  Tortora  (1985).  For  a  more 
complete  description  of  the  CATI  environment  in 
general,  see  the  paper  by  Nicholls  and  Groves 
(1986)  in  these  proceedings. 

2/  UNIX  is  a  trademark  of  AT&T  Bell  Labs. 
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CATI  replaces  the  paper  and  pencil  question¬ 
naire  historically  being  used  for  telephone  data 
collection.  The  telephone  interviewers  now  have 
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TWO  MRPP  RANK  TESTS  AND  THEIR  SIMULATED  POWERS  FOR  SOME  ASYMMETRIC  POPULATIONS 


Derrick  S.  Tracy,  University  of  Windsor 


ABSTRACT 


Two  rank  tests,  based  on  multiresponse 
permutation  procedure,  are  compared  with  respect 
to  their  empirical  powers,  when  sampling  from 
underlying  Weibull  and  garana  populations,  for 
different  parameters.  Simulated  powers  of  the 
test  statistics  against  various  location  shifts 
are  examined  under  Pearson  Type  III  and  Type  VI 
distribution  approximations.  The  relative  gains 
in  power  depend  upon  the  values  of  the  parameter. 

INTRODUCTION 

For  most  tests  of  hypotheses,  the  test 
statistic  is  derived  under  several  assumptions, 
such  as  normality  and  homogeneity  of  variances. 
To  avoid  making  such  assumptions  when  analyzing 
multiresponse  data,  Mielke,  Berry  and  Johnson 
(1976)  proposed  an  exact  permutation  procedure 
and  introduced  the  MRPP  (multiresponse  permuta¬ 
tion  procedure)  test  statistic.  It  is  optimal 
when  responses  are  made  commensurate  with  each 
other.  It  is  applicable  to  data  at  ordinal 
level  or  higher,  as  encountered  often  in  social 
and  biological  sciences.  The  exact  permutation 
procedure  requires  very  heavy  computations; 
hence  certain  approximations  are  considered. 

MRPP  STATISTICS 

Let  SI  be  a  population  of  observations 
Xj , . . . ,X^.  Let  K  of  them  be  classified 

according  to  some  a  priori  classification  scheme 
into  g  mutually  exclusive  subgroups 


Sl’"-*Sg 


with  n^  observations  in  , 

K  =  n  . .  observations  in  the  excess 


leaving  N-K  -  ng+1  observations  in  the  excess 
subgroup  Sg+i  .  The  MRPP  test  statistic  6  i 

weighted  average  of  distances  between  all  pairs 
of  observations  within  each  of  the  classified 


subgroups.  Thus 


are  weights  with 


Ci^i  *  where  C,  >  0 


,  1 

n.  _1  H 
r  =  1  VN 

’i  2  Ll 


kjau  ww 


is  the  average  of  a  distance  measure  A 


between 


..j,  ......  ..j  ,  where  S^X^)  is  an 

indicator  function,  taking  the  value  1  if  Xj. 
is  in  S,  and  0  otherwise.  When  the  classifi- 

1  o+i 

cation  is  random,  each  of  the  N.'/n®  n^.' 

permutations  is  equally  likely  to  occur.  Thus 
the  value  of  6  is  likely  to  be  higher  than 
when  the  classification  is  done  according  to 
some  a  priori  scheme.  Therefore  an  a  level 
test  rejects  'Hq:  Classification  is  random'  if 

6  <  S  . 

—  a 

Mielke,  Berry,  Brockwell  and  Williams  (1981) 
consider  special  cases  of  6  with  C,  *  n,  /K 


„tJ  |R(XI)-R(XJ)|  .  „v„t . . 

rank  of  X j.  .  For  v=l  and  2,  they  denote  the  test 
statistic  by  6^  and  6^  respectively.  With 
n^^  *  0,  g  *  2,  v  -  2,  n^  -  n^,  the  MRPP  test 
based  on  6.  is  equivalent  to  the  two-sided 

v1 

Wilcoxon  test.  For  g  >  2,  ^  —  ,  N-K,  6^ 

is  equivalent  to  the  Kruskal-Wallis  test.  But 
when  2,  N  =  K,  r  -  1  and  «  n^/K  , 

Brockwell,  Mielke  and  Robinson  (1982)  show  that 
5  has  a  non-normal  non-invariant  distribution, 
and  its  asymptotic  distribution  depends  on  the 
underlying  distribution  of  observations. 

For  \)  *  1  ,  Tracy  and  Tajuddin  (1986)  study 
the  distribution  of  6^  for  large  samples  when 

g  =  2,  and  n^  »  n^  =  N/2  for  several  underlying 

symmetric  populations.  Here  we  consider  asym¬ 
metric  underlying  populations,  taking  them  to  be 
Weibull  and  gamma  with  different  parameters.  We 
conduct  an  extensive  simulation  study  based  on 
10,000  samples  from  the  underlying  populations. 

Using  Mielke  et  al.  (1981)  results  for  the 
first  three  moments  of  6^  ,  and  those  of  Tracy 

and  Tajuddin  (1985)  for  the  fourth  moment,  we 
obtain  0^  and  02  °f  6 j  ,  and  the  Pearson 

criterion  202  -  30^  -  6  for  various  values  of 

N  .  This  indicates  (Tracy  and  Tajuddin,  1986) 
that  for  N  >  34,  the  Pearson  Type  VI  distribu¬ 
tion  is  a  better  approximation.  We  obtain  powers 
of  6^  both  under  Pearson  Type  VI  and  Type  III 

approximations,  and  compare  with  the  powers  of 
&2  using  Pearson  Type  III  approximation,  which 

is  known  to  be  its  asymptotic  distribution. 

THE  METHOD 

We  consider  10,000  independent  samples  of  80 
observations  from  Weibull  W(0)  populations, 
with  0  =  0.5,  0.67,  0.8,  1.0,  1.5  and  2.0.  The 
case  of  0  =  1.0  is  the  case  of  the  exponential 
population.  Similarly  we  consider  gamma  G(0) 
populations,  with  0  -  0.3,  0.5,  2.0  and  3.0. 

The  case  of  G(1.0)  is  again  the  exponential. 

We  shift  the  last  40  of  the  80  observations 
by  kc  ,  where  k  proceeds  from  0  to  some 
appropriate  value  so  that  power  curves  can  be 
drawn.  The  number  of  rejections  was  counted  for 
the  choice  of  a  =  0.01,  0.05  and  0.10. 

We  present  our  results  for  empirical  power  in 
Table  I  for  Weibull  and  Table  II  for  gamma  under¬ 
lying  populations.  We  also  present  power  plots 
for  these  cases,  obtained  by  using  the  cubic 
spline  method  of  Interpolation.  The  standard 
error  of  any  estimated  power  is  bounded  by 
/'(O.  5)  (0.5)10000  =  0.005.  Thus  any  difference 
in  power  of  more  than  2(0.005)  =  0.01  is 
significant  at  least  at  the  5%  level  of  signifi¬ 
cance. 

The  samples  were  drawn  using  IMSL  subroutines 
for  the  respective  populations. 


,  R(Xj)  being  the 


For  v=l  and  2,  they  denote  the  test 
6^  and  62  respectively.  With 
2,  v  -  2,  n^  -  n2>  the  MRPP  test 
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TABLE  I 


Empirical  Powers  of  6^  and  6 ^  for  Weibull  Populations 


Statistic 

Type 

a+ 


1  2 
VI  III  III 

B  =  0.5 


Shift-*- 

0.0 

0.1a 

0.2a 

0.3 a 

0.01 

.0088 

.0085 

.0087 

.7345 

.7315 

.4789 

.9833 

.9823 

.8534 

.9987 

.9987 

.9637 

0.05 

.0448 

.0444 

.0480 

.9343 

.9332 

.7141 

.9990 

.9990 

.9501 

1.000 

1.000 

.9912 

0.10 

.0927 

.0924 

.0953 

.9774 

.9770 

.8081 

.9998 

.9998 

.9760 

1.000 

1.000 

.9965 

6 

=  0.67 

Shift 

0.0 

0.1a 

0.2a 

0.3a 

0.01 

.0088 

.0085 

.0087 

.1546 

.1530 

.1268 

.5967 

.5933 

.4420 

.8878 

.8861 

.7309 

0.05 

.0448 

.0444 

.0480 

.3929 

.3906 

.3157 

.8508 

.8492 

.6844 

.9790 

.9784 

.8856 

0.10 

.0927 

.0924 

.0953 

.5571 

.5560 

.4325 

.9345 

.9339 

.7818 

.9945 

.9944 

.9368 

b 

=  0.8 

Shift 

0.1a 

0.3o 

0.5a 

0.7a 

0.01 

.0572 

.0561 

.0599 

.6119 

.6086 

.4983 

.9546 

.9538 

.8744 

.9969 

.9968 

.9792 

0.05 

.1821 

.1805 

.1783 

.8496 

.8478 

.7339 

.9932 

.9930 

.9624 

.9997 

.9997 

.9964 

0.10 

.2959 

.2951 

.2773 

.9286 

.9281 

.8233 

.9987 

.9987 

.9814 

1.000 

1.000 

.9982 

6 

=  1.0 

Shift 

0.1a 

0.3a 

0.5a 

0.7a 

0.01 

.0255 

.0252 

.0301 

.3057 

.3034 

.2877 

.  7734 

.7715 

.7079 

.9690 

.9682 

.9332 

0.05 

.1032 

.1022 

.1106 

.5682 

.5648 

.5300 

.9319 

.9308 

.8792 

.9956 

.9956 

.9833 

0.10 

.1776 

.1767 

.1862 

.7013 

.7009 

.6500 

.9684 

.9683 

.9325 

.9990 

.9990 

.9927 

6 

=  1.5 

Shift 

0.1a 

0.2a 

0.30 

0.4a 

0.01 

.0146 

.0152 

.0159 

.0462 

.0458 

.0509 

.1131 

.1123 

.1236 

.2315 

.2298 

.2477 

0.05 

.0695 

.0685 

.0754 

.1482 

.1467 

.1613 

.2813 

.2792 

.3082 

.4654 

.4628 

.4909 

0.10 

.1311 

.1309 

.1372 

.2354 

.2346 

.2547 

.4042 

.4030 

.4281 

.5964 

.5958 

.6139 

Shift 

0.5a 

0.6a 

0.7a 

0.9a 

0.01 

.4008 

.3976 

.4196 

.5898 

.5868 

.6006 

.7489 

.7472 

.7562 

.9475 

.9463 

.9418 

0.05 

.6510 

.6486 

.6641 

.8054 

.8043 

.8109 

.9117 

.9102 

.9090 

.9887 

.9885 

.9869 

0.10 

.7622 

.7609 

.7726 

.8870 

.8869 

.8880 

.9564 

.9562 

.9530 

.9960 

.9960 

.9945 

B 

=■  2.0 

Shift 

0.1a 

0.2  a 

0.  3o 

0.4o 

0.01 

.0134 

.0133 

.0144 

.0407 

.0402 

.0431 

.0943 

.0930 

.1028 

.1882 

.1860 

.2026 

0.05 

.0656 

.0648 

.0711 

.1320 

.1305 

.1428 

.2421 

.2398 

.2632 

.4003 

.3981 

.4310 

0.10 

.1248 

.1245 

.1311 

.2128 

.2125 

.  2274 

.3539 

.3530 

.3844 

.5258 

.5250 

.5602 

Shift 

0.5a 

0.6a 

0.7  0 

0.9o 

0.01 

.3269 

.3246 

.3553 

.4956 

.4935 

.5295 

.6588 

.6567 

.6897 

.9018 

.9007 

.9164 

0.05 

.5723 

.5701 

.6053 

.7224 

.7204 

.7574 

.8536 

.8515 

.8745 

.9730 

.9725 

.9790 

0.10 

.6822 

.6817 

.7168 

.8238 

.8232 

.8463 

.9145 

.9144 

.9307 

.9883 

.9883 

.9909 

K» 


‘Ski 

*»!• 

•>*< 

iVl', 

w 

V-V 

'«* 

iVi, 

\‘W 

r» 

i 

i 


TABLE  II 

Empirical  Powers 

of  6^ 

and  $2  for  Gamma  Populations 

Statistic 

51  62 

51 

52  5, 

62 

Type 

VI  III  III 

VI 

III 

III  VI  III 

III  V 

a+ 

6 

-  0.3 

Shift-*- 

0.0 

O.lo 

0.13 O 

.5666 

.8512 

.9402 

.5616 

.8488 

.9394 

.3238  .7206  .7176 

.5642  .9320  .9306 

.6812  .9752  .9750 

.4344  .91 

.6804  .98 

.7776  .99 

0.2O 

0 

.9104 

8 

.9876 

8 

.9968 

9420  .9414 


0690  .1496 
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POWER  OF  MRPP  TESTS 

UNDERLYING  DISTRIBUTION;  W(l.5) 
N,  =N,  =40 


POWER  OF  MRPP  TESTS 

UNDERLYING  DISTRIBUTION;  W(2.0) 
N,  =  N,  =40 
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POWER  OF  MRPP  TESTS 

UNDERLYING  DISTRIBUTION:  G(0.3) 
N,  =N,  =40 
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POWER  OF  MRPP  TESTS 

UNDERLYING  DISTRIBUTION:  G(0.5) 
N,  =N.  =40 
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POWER  OF  MRPP  TESTS 

UNDERLYING  DISTRIBUTION:  G(2.0) 
N,  =N,  =40 
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FIGURE  9 


CONCLUSIONS 

We  observe  from  Tables  I  and  II  that  for  all 
underlying  Weibull  and  gamma  populations  consid¬ 
ered,  the  empirical  power  of  8^  under  Pearson 

Type  VI  approximation  is  always  greater  than  that 
under  the  Type  III  approximation.  We  therefore 
plot  the  power  curves  of  <5 ^  under  the  Type  VI 

approximation  and  those  of  6^  under  the  Type 

III  approximation.  These  are  shown  oy  a  dotted 
line  and  a  solid  line  respectively  in  Figures 
1  -  10.  From  the  tables  and  the  plots,  we  draw 
the  following  conclusions. 

For  low  values  of  8  ,  i.e.,  for  very  skewed 
populations,  the  powers  of  6^  and  6^  increase 

very  sharply.  The  powers  get  close  to  1  by  the 
time  the  location  shift  is  0.3 a  for  W(0.5)  and 
G(0.3).  The  power  of  5^  is  much  greater  than 

that  of  $2  for  8mall  location  shifts.  As  8 

approaches  1,  the  sharpness  in  the  increase  of 
powers  of  6^  and  62  reduces  gradually,  with 

always  having  greater  power  than  ^  • 

For  B“1  (the  exponential  population),  the 
power  of  5^  is  greater  than  that  of  {2  for 


POWER  OF  MRPP  TESTS 

UNDERLYING  DISTRIBUTION:  G(3.0) 
N,  =N,  =40 
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for  larger  location  shifts.  By  the  time  8  is 
3,  62  has  consistently  higher  power  than  6^  . 

Overall,  it  seems  that  6^  performs  better 

for  more  skewed  Weibull  and  gamma  populations, 
but  as  the  parameter  increases  and  the  distribu¬ 
tion  tends  towards  symmetry,  6^  begins  to 
perform  better  than  6^  . 
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Introduction 


Iterative  procedures  for  solving 
large  sets  of  sparse  linear  equations 
have  been  known  for  a  very  long  tiae 
but  only  since  the  highspeed  digital 
computer  becaae  available  in  the  1950's 
have  these  techniques  been  in  popular 
use.  The  need  to  solve  discrete  ana¬ 
logs  of  partial  differential  equations 
of  elliptic  and  parabolic  type  acti¬ 
vated  auch  of  the  research  directed 
toward  analysing  and  iaproving  iter¬ 
ative  scheaes  (3,12].  The  least 
squares  solution  of  rectangular  systeas 
of  linear  equations  has  received  con¬ 
siderable  attention  also,  but  rel¬ 
atively  little  of  the  research  has  been 
focused  upon  iterative  aethods.  Our 
purpose  here  is  to  suggest  that  the 
discretization  of  the  variational 
problea  associated  with  the  differ¬ 
ential  equation  leads  to  a  natural 
foraulation  of  a  rectangular  problea 
solved  by  an  iterative  procedure. 
Finally,  the  process  aay  be  applied  to 
aore  general  regression  probleas  and 
can  provide  an  "inner"  iteration  for 
the  nonlinear  case. 

Iterative  Methods  for  Solving 
Difference  Equations 

Our  discussion  starts  with  the 
solution  of  Laplace’s  equation.  We 
propose  first  to  solve  V2W  =  0  in  a 
closed  region  with  W  taking  on  pre¬ 
scribed  values  on  the  boundary.  In 
particular,  we  choose  the  "Model 
Problea",  V2W=0  on  a  square  with  W 
fixed  on  all  four  sides.  Upon  the 
square,  a  unifora  grid  of  aesh  length 
1.0  is  iaposed.The  grid  points  in  the 
interior  are  nuabered  in  a  regular 
fashion;  in  our  case,  we  use  the  left 
to  right,  top  to  bottoa  ordering  of 
Bnglish  text.  The  nine  interior  points 
shown  below  is  the  saallest  nuaber  that 
includes  all  of  the  properties  we  wish 
to  illustrate;  the  nuaerical  exaaples 
to  follow  will  be  based  upon  Figure  1. 


.  1 

.2 

.3 

J  W=0 

.4 

.5 

.6 

* 

.7 

.8 

.9 

I  N  = 

1 

2 

1 

Figure  1 


The  finite  difference  stencil 


that  produces  a  discrete  approxiaat ion 
to  the  Laplace  operator  is  centered 
over  each  aesh  point,  thus  producing  a 
set  of  nine  linear  equations  in  nine 
unknowns;  AW=k. 

f  4  -1  -1  W.1  [o 

hi  4  -1  -1  W2 

-14  -1 


-l  4 

-1 

-1 

-1 

4 

-1 

4  -1 

-1 

-1  4 

-  1 

-1 

The  aatrix  A  is  banded  with  five 
diagonals,  it  is  tridiagonal  by  blocks 
associated  with  rows  of  the  grid 
points,  the  diagonal  blocks  are  tri¬ 
diagonal  in  theaselves,  and  finally  the 
aatrix  A  and  each  of  its  diagonal 
blocks  are  positive  definite.  The 
sparseness  of  the  aatrix  is  evident  in 
the  9x9  exaaple ;  for  a  grid  with  aany 
points,  our  coefficient  aatrix  A  is 
aostly  zero. 

The  sparsity  of  nonzero  eleaents  of 
a  very  large  aatrix  leads  one  to  use 
iterative  aethods  of  solution.  The 
aatrix  A  is  split  into  two,  A=D-C,  so 
that  the  linear  equations  becoae 
DW=CW+k .  If  D~l  exists,  as  when  D  is 
the  block  diagonal  partition, 

W  =  D-»Ct*  ♦  D»k. 

The  classic  Jacobi  aethods  have  this 
fora;  point  Jacobi  has  a  strictly 
diagonal  D  while  one  line  Jacobi  uses 
the  block  diagonal  fora. 

Several  variations  of  this  splitting 
technique  have  been  studied  such  as 
Gauss-Seidel ,  Successive  Overrelax- 
ation,  and  Alternating  Direction 
Iaplicit  Methods.  Much  of  the  accel¬ 
eration  obtained  by  these  variations 
depends  upon  the  block  tridiagonal 
nature  of  the  difference  equations  and 
is  not  guaranteed  in  general.  It  is 
true,  however,  that  for  the  aost  part, 
the  value  of  the  iteration  aethod  coaes 
froa  reduced  storage  requireaenta  and 
fewer  arithaetic  operations. 

If  the  original,  AW-k,  were  to  be 
solved  by  direct  eliaination  aethods, 
one  would  ordinarily  decoapose  A  into 
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triangular  factors;  for  symmetric  A, 
the  Choleski  method  would  be  suitable. 
For  banded  matrices,  factoring  destroys 
sparsity  between  the  extreme  bands  but 
does  not  disturb  zeroes  outside  of  the 
band.  If  AW=k  is  written  as  DW=CW+k, 
the  iteration  would  be  carried  out  by  a 
direct  solution  process  where  Choleski 
factoring  is  applied  to  D.  For  this 
reason,  the  iteration  has  been  called 
Incomplete  Cholesk i [ 7 , 9 ) . 

One  point  of  view  suggests  that  the 
iteration  is  successful  because  of  the 
dominance  of  D.  With  this  in  mind,  the 
splitting  process  may  replace  elements 
of  R  by  zero  when  they  are  identified 
with  small  elements  of  A.  The  effect 
is  to  maintain  a  dominant  and  easily 
solved  D  (in  factored  form)  at  the 
expense  of  increased  density  of  nonzero 
elements  of  C.  The  meaning  of 
"smallness"  is  a  matter  of  judgment(9]. 

Symmetric  Matrices  and  Normal  Equations 

It  is  desirable  that  the  coefficient 
matrix  be  symmetric  and  positive  def¬ 
inite.  This  property  has  important 
consequences  affecting  the  performance 
of  the  iterative  procedure  and  the 
amount  of  computer  memory  and  arith¬ 
metic  operations  required  for  its 
execution.  Two  methods  of  deriving  the 
difference  stencil  that  ensure  symmetry 
have  been  proposed.  Varga[ll]  applied 
Green’s  theorem  to  a  mesh  box  with  the 
grid  point  in  its  interior  and  mimicked 
by  differences  the  normal  derivatives 
that  occur  in  the  theorem.  When 
applied  to  each  grid  point,  a  symmetric 
system  of  equations  is  generated. 

A  macroscopic  application  of  Green’s 
theorem  was  suggested  by  Engli,  et  al, 
and  by  Forsythe  and  Wasow[2,3].  The 
original  differential  equation  is  the 
Euler  equation  associated  with  a  vari¬ 
ational  principle  and  they  suggest  that 
the  functional  be  discretized  before 
minimization.  Then  the  set  of  linear 
equations  to  be  solved  is  simply  the 
set  of  normal  equations  associated  with 
an  overdetermined  linear  system  to  be 
solved  by  least  squares.  Thus,  the 
coefficient  matrix  is  sure  to  be 
symmetric  and  at  least  semidef ini te . 

The  positive  definite  property  is 
obtained  separately  when  using  Varga’s 
derivat ion. 

For  our  problem,  the  basic 
functional  is 


J  =  0. 


[ (3W/3x)2+(3W/3y)2 ) ] ds . 


When  this  expression  is  discretized  on 
the  same  grid  as  before,  we  have  a 
quadratic  form  in  the  variables  Wi , 

1  =  1,2 . 9,  whose  minimum  is  located 

by  solving  3J/3Wi=0,  i=l,2,...,9; 
that  is,  the  normal  equations  of  an 


overdetermined  linear  system,  AW=k. 
Because  A=FlF,  we  are  guaranteed 
symmetry  and  semidefiniteness.  In 
addition,  if  Choleski  factoring  is 
applied  to  A  producing  A=R‘R,  it  ia  a 
fact  that  orthogonal  decomposition  of  F 
into  QR,  where  0*0=1,  produces 
exactly  the  same  R[8). 

The  derivation  of  a  symmetric 
positive  definite  coefficient  matrix 
from  a  variational  principle  ia  by  no 
means  limited  to  our  Model  Problem. 
Other  boundary  conditions  applied  to 
all  or  part  of  the  boundary,  for 
example  a  symmetry  condition,  will  be 
treated  in  the  discretization  of  the 
functional  and  symmetric  coefficient 
matrices  will  always  be  the  result.  A 
more  complicated  functional  with 
concomitant  complications  in  the 
boundary  conditions  arises  in  mechanics 
[10,13].  The  biharmonic  equation 
=  0  is  the  Buler  equation 
associated  with 

J  =  jj{0.5[32W/3x2+v32W/3y*]2 

0 . 5 ( 1-v  * ) (3  2 W/3  yi)z 

(1-a>  )(3*W/3x3y)2}  dA 

(0  <  v  <  1) 

The  discretization  of  J  is  again  a 
quadratic  form  that  may  be  interpreted 
as  normal  equations  of  an  over- 
determined  linear  system. 

Incomplete  Factoring  and  Incomplete 
Orthogonal izat ion 

Choleski  factoring  of  normal 
equations  and  orthogonal  factoring  of 
the  overdetermined  system  have  been 
successful  as  direct  methods  for  linear 
least  squares  problems.  Furthermore, 
an  incomplete  Choleski  factoring  avoids 
the  "fill-in"  of  nonzero  entries 
between  the  bands,  locations  which  are 
occupied  by  zeroes  in  the  unfactored 
matrix.  This  incompleteness  induces  a 
splitting  of  A  to  produce  an  iterative 
scheme.  It  is  only  natural  to  wonder 
if  an  incomplete  orthogonalization  of  F 
might  produce  a  successful  iterative 
method  to  find  the  least  squares 
solution  of  the  overdetermined  system. 

A  splitting  of  the  rectangular 
matrix  F  =  H-G  produces  a  normal  equation 
matrix  of  the  form 

A  =  F*  F=(HlH+GtG)-(HtG+GtH) . 

A  natural  splitting  of  A  =  D-C  is  to  let 
D  =  H*  H+G* G  and  C=R‘G+G*H.  If  D 
and  C  have  suitable  properties,  e.g. 
form  a  regular  splitting,  then  an 
iteration  of  the  form 


m 
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(Eq.  1) 


DW.»i -CWa+k 


■ay  be  solved  iteratively.  Each  update 
is  a  least  squares  problem  in  its  own 
right  whose  noraal  equations  are 


will  converge  [12].  If  in  addition,  D 
is  sparse  and  easily  solved,  for 
exaaple  diagonal  or  tridiagonal,  the 
calculation  is  inexpensive. 

Unfortunately,  an  iteration  of  the 
saae  nature  applied  to  the  rectangular 
systea  will  not  solve  the  problea 
unless  the  splitting  is  "proper” [ 1 ] .  An 
easy  way  to  reproduce  (Eq.  1)  in  rect¬ 
angular  fora  is  to  double  the  row 
dimension.  Write  F  =  H-G  and  set 
HW=GW+k ,  GW=HW-k  to  be  solved  simul¬ 
taneously.  Thus  the  rectangular 

"”‘“'[2]  '  [»]  "■  *  Li] 

can  be  solved  repeatedly.  The  noraal 
equations  for  Wa . i  have  D  for  the 
coefficient  matrix,  etc.  but  can  be 
solved  by  applying  the  QR  algorithm 

er 

The  R  produced  in  this  aanner  is 
exactly  the  Choleski  factor  of  D.  For 
this  reason  we  can  call  this  an 
incomplete  QR  aethod. 


A  Specific  Splitting  for  72W-0 

On  the  9x9  grid  of  Figure  1,  the 
discrete  analog  of  the  functional  is  a 
set  of  linear  equations  FW  =  b. 


1 

-1 


1 

-  1 


1 

-1 


1-1 


1 

-1 


1 

-1 


1 

-1 


1 

-1 


-1 


-1 


-1 


W  = 


This  set  of  equations  consists  of  24 
rows  and  9  columns.  Let  us  choose  the 
diagonal  band  froa  F  coaaencing  at  row 
16,  coluan  1  to  constitute  our  aatrix 
G,  H  being  the  remaining  portion.  Then 
the  48*9  systea 

cr  ■  \t 


(H‘H  +  G*  G  )W  =  ( Hl H  +  G‘G)W  + 

(H‘k  -  Gl  k ) 

This  process  is  precisely  the  Jacobi 
single  line  iteration  applied  to  the 
square  set  of  noraal  equations. 

When  the  aatrix  |h|  is  factored  by 

the  QR  aethod,  the  upper  triangular 
aatrix  R  is  the  Choleski  factor  of  the 
diagonal  partition  of  the  noraal 
equations  corresponding  with  one  line 
of  grid  points. 

Nonlinear  Least  Squares 

The  solution  of  a  nonlinear  least 
squares  problea  is  often  obtained  by 
the  Gauss-Newton-Hartley  aethod  or 
Grey’s[4]  variation  of  it.  Each  of  the 
functions  in  the  overdeterained  set  is 
linearized  at  soae  noainal  guess  and  a 
correction  is  obtained  by  solving  the 
linear  approximation  as  a  conventional 
least  squares  problea.  The  coefficient 
aatrix  is  the  Jacobian  of  the  original 
systea  and  on  occasion  aay  be  quite 
sparse.  An  iterative  solution  of  the 
linearized  systea  may  be  the  aost 
efficient  way  to  proceed.  The 
rectangular  formulation  as  mentioned 
above  Bight  well  be  considered  for  the 
solution  of  the  "inner  iteration." 

As  an  exaaple,  we  Bight  choose  a 
standard  test  problea  generally 
attributed  to  C.  F.  Wood  [5].  It  is 
often  presented  as 


Min 
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in  which  18  of  the  28  entries  are  zero. 
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The  splitting  of  F  is  obtained  by  an 
incomplete  factoring  of 

F  =  |Hj  =  QR  “  E- 

where  the  ra«  element  of  R  is 
replaced  by  zero.  For  example,  we 
evaluated  the  Jacobian  at  (0,0, 0,0,) 
and  carried  out  the  decomposition.  The 
0,  R,  and  E  are  shown  below. 
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We  executed  the  inner  iteration  from 
this  starting  point  within  the  Gauss- 
Hartley  scheme  and  found  the  solution 
(1,1, 1,1)  as  expected. 


Summary 

By  relating  difference  equations 
with  normal  equations  through  a  vari¬ 
ational  principle,  we  have  attempted  to 
formulate  an  iterative  procedure  for  a 
sparse  overdetermined  system  in  terms 
of  conventional  methods  for  solving 
difference  equations.  In  particular, 
we  demonstrate  a  Jacobi  iteration  in 
rectangular  form.  For  large  nonlinear 
problems  with  sparse  Jacobiana,  such 
iterative  processes  may  be  useful  in 
obtaining  the  updates  in  a  Gauss-Newton- 
Hartley  type  algorithm. 
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POLYHEDRON  GRAPHS 


Danny  V.  Turner,  John  W.  Seaman,  and  Dean  M.  Young 
Baylor  University 

ABSTRACT  In  the  next  aectlon  we  introduce  a  staple 

generalization  of  star  plots. 


This  paper  introduces  a  new  multivariate 
graphical  technique  known  as  polyhedron  graphics, 
or  P-graphs.  The  independent  variables  are 
displayed  as  a  polar  projection  in  the  base  plane 
of  a  three  dlaensional  coordinate  syatea,  forming 
a  base  polygon.  A  dependent  variable  is  graphed 
above  the  origin  of  the  polar  projection.  Lines 
are  projected  from  the  plotted  dependent  variable 
to  the  vertices  of  the  base  polygon,  forming  a 
polyhedron.  An  additional  variable  can  be  plotted 
below  the  base  plane.  Other  dependent  variables 
can  be  represented  by  moving  the  base  polygon  in 
the  base  plane,  rotating  the  base  polygon,  etc. 
Results  of  a  three-dimensional  graphics  implemen¬ 
tation  are  presented  along  with  an  example  from 
experimental  design. 


Figure  1.  A  typical  polygon  graph  (star  plot). 


1 .  INTRODUCTION 

In  this  paper  we  present  a  graphical  proce¬ 
dure  that  can  be  used  to  display  multidimensional 
data.  A  variety  of  such  procedures  already  exist, 
and  Include  such  things  as  function  plots,  linear 
profiles,  polar  proflles(k-alded  polygons),  Cher- 
noff-type  faces,  and  ao  forth.  The  common  math¬ 
ematical  basis  that  these  specialized  graphical 
methods  have  la  that  each  can  h£  used  to  map  a 
point  in  k-dlmensional  space  (K  )  to  a  subset  of 
two-dimensional  space  (KZ)  while  attempting  to 
preserve  in  some  way  the  original  information. 
Thus,  a  set  of  vectors  X,,...,Xn  with  each  Xj  e 

R  is  mapped  to  a  collection  of  subsets  Sj . Sn 

with  each  e  R  .  These  subsets  can  then  be 
drawn  using  appropriate  computer  graphics  devices 
and  studied  with  the  Intent  of  extracting  infor¬ 
mation  about  the  original  points. 

It  is  obvious  that  there  are  an  unlimited 
number  of  mappings  like  those  described  above. 

The  purpose  of  this  article  is  to  present  a  new 
display  technique  that  is  essentially  a  simple 
generalization  of  one  of  the  oldest  and  most  popu¬ 
lar  existing  methods,  namely  polygon  graphs. 


2 .  POLYGON  GRAPHS 

With  roots  going  back  to  the  1950' s,  polygon 
graphs  represent  one  of  the  earliest  graphical 
attempts  at  displaying  multivariate  data.  These 
graphs  and  numerous  variations  may  be  found  under 
many  names  including  k-slded  polygons,  stars,  spi¬ 
derwebs,  polar  profiles  and  sunflowers.  Figure  1 
shows  a  typical  star  representing  a  point  in 
12-dimensional  space.  The  values  of  the  twelve 
variables  are  transformed  into  the  lengths  of  the 
twelve  equally-spaced  rays  emanating  from  the 
polar  origin.  (The  use  of  equally-spaced  rays  is 
common  but  obviously  not  required.) 

Applications  using  star  plots  are  plentiful. 
For  typical  examples  refer  to  Chambers,  Cleveland, 
Kleiner,  and  Tukey  (1983)  or  Turner  and  Hall 
(1983).  These  plots  have  many  advantages 
including  ease  of  generation  and  interpretation. 


3 .  POLYHEDRON  GRAPHS 

Building  on  the  idea  of  star  plots  (polygon 
graphs)  we  propose  polyhedron  graphs.  These 
displays  are  formed  by  using  the  values  of  the 
variables  in  a  k-dimenslonal  vector  to  control  the 
poaltlons  of  the  vertices  of  a  polyhedron.  Figure 
2  displays  a  prototype  polyhedron  graph  or  P-graph 
as  we  shall  refer  to  them. 


Figure  2.  A  prototype  polyhedron  graph 
(P-graph).  Notice  that  the  variables  in  the  aB 
plane  produce  a  portion  of  a  star  plot.  This 
partial  star  is  called  the  base  polygon.  It  is 
hidden  somewhat  be  the  shaded  faces  of  the 
polygon. 


Applications  that  will  be  particularly 
suitable  for  P-graphs  will  Involve  situations 
where  there  is  a  natural  decomposition  of  the 
multiple  variables  into  two  groups.  One  group  of 
variables  will  be  coded  into  the  base  polygon  and 
the  other  group  into  vertices  not  in  the  base 
polygon.  An  example  would  be  multiple  regression 
of  one  response  variable  on  the  k  independent 
variables  with  the  k  Independent  variables 
controlling  the  base  polygon  and  the  response 
variable  controlling  one  vertex  outside  the  base 
polygon  as  in  Figure  2.  An  alternative  is  to  use 
the  value  of  the  residual  at  each  observation  to 
control  the  length  of  the  ray  outside  the  of* 
plane.  A  second  dependent  variable  can  be  repre¬ 
sented  by  adding  a  vertex  below  the  base  polygon 
as  shown  in  Figure  3.  Pairs  of  P-graphs  with  com- 


mon  base  polygons  can  be  used  to  add  more  depen¬ 
dent  variables. 
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Figure  3 .  A  P-graph  with  two  variables  repre¬ 
sented  outside  the  base  polygon. 


4 .  EXAMPLE 

Evolutionary  operation  (EVOP)  is  a  statistics 
based  methodology  for  process  Improvement  in  the 
context  of  an  operating  full-scale  process.  A 
detailed  exposition  Is  found  in  Box  and  Draper 
(1969).  Within  this  context  consider  a 
2^  factorial  design  with  the  three  factors  tem¬ 
perature,  concentration  and  pressure  and  the 
response  variable  average  yield. 

Figure  4  shows  a  common  way  to  display  the 
multivariate  data.  As  the  produciton  runs  are 
made  under  the  various  treatment  conditions  the 
values  of  average  yield  are  shown  at  the 
appropriate  vertices. 

2.8  4.4 


Suppose  that  EVOP  is  to  be  run  using  a 
25  factorial  design  involving  five  factors  and  two 
response  variables.  The  usual  EVOP  displays  are 
not  adequate  for  this  situation.  However,  P- 
graphs  provide  a  simple  technique  for  displaying 
the  results  of  the  EVOP  runs  that  Is  easy  to 
understand  by  the  process  operators.  A  reference 
P-graph  Is  shown  In  Figure  5.  The  Independent 
variables  are  pressure,  temperature,  concentration 
of  A,  concentration  of  B,  and  concentration  of  C; 
the  response  variable  are  average  yield  and 
average  tensile  strength. 


Figure  5.  Reference  P-graph  for  the  EVOP 
example  Involving  five  Independent  variables  and 
two  response  variables. 

As  each  production  run  In  the  design  Is 
performed  the  corresponding  P-graph  would  be 
displayed  to  the  process  operator  (and  updated  in 
the  case  of  replicates) .  The  evolution  of  the 
process  could  easily  be  tracked  and  directed  by 
observing  the  time  series  of  P-graphs.  A  portion 
of  such  a  time  series  is  displayed  In  Figure  6. 

It  is  eaBy  to  visualize  how  yield  and  tensile 
strength  are  changing  with  respect  to  the  changes 
In  the  levels  of  the  independent  variables.  We 
feel  this  technique  would  be  quite  valuable  to  a 
trained  process  operator. 


T)  T2 


Figure  6.  Selected  P-graphs  for  the  EVOP  example. 
T.  <  T^  <  Tj  <  T^  are  the  corresponding  produc¬ 
tion  run  times. 
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EFFICIENT  ESTIMATION  &  TESTING  FOR  HETEROSC ED AST IC IT Y  WITHOUT  AUXILIARY  VARIABLES 
H.D.  Vlnod  (Fordham  University)  and  Aman  Ullah  (University  of  Western  Ontario) 


1.  INTRODUCTION 

The  available  methods  for  estima¬ 
tion  and  testing  In  the  presence  of 
heteroscedas t i c  errors,  Vlnod  and  Ullah 
(1981),  need  a  specific  parametric  form 
of  heteroscedastlcity .  Waldman  (1983) 
shows  "algebraic  equivalence"  of 
Wh  1 1 e  '  s ( 1 980 )  test  with  certain  vers¬ 
ions  of  Godfrey  and  Breusch-Pagan 
tests  which  rely  on  auxiliary  vari¬ 
ables.  In  this  paper  we  propose  a  new 
two-step  generalized  least  squares 
( GLS )  estimator  which  is  consistent  and 
asymptotically  efficient.  In  the  first 
step  we  develop  J.N.K.  Rao  (1973)  type 
modified  minimum  norm  quadratic  esti¬ 
mator  (MINQE)  of  the  unknown  hetero- 
seedastic  variances  based  on  replicated 
observations  for  the  variables  in  the 
model.  The  replicated  observations  in 
our  paper  are  created  in  the  framework 
of  Vinod's  (1982a, b)  use  of  the  addi¬ 
tional  Information  contained  in  the 
fact  that  only  a  certain  number  of 
digits  in  the  variable  data  series  are 
reliable,  and  beyond  which  there  is 
fuzziness.  Latin  squares  style  repli¬ 
cations,  which  are  well-known  in 
statistical  design  of  experiments 
literature,  Kendall  and  Stuart ( 1979 ) , 
are  then  used.  For  testing  of  hetero- 
scedasticity  we  propose  appropriate 
test  statistics. 

2.  THE  MODEL  AND  ESTIMATORS 

Consider  the  usual  regression 

model 

y  -  X8  ♦  u  (2.1) 

where  y  is  a  Txl  vector  X  is  a  Txp 
matrix  of  p  regressors,  B  is  a  pxl 
vector  of  unknown  regression  coeffi¬ 
cients,  and  u  is  a  Txl  vector  of  dis¬ 
turbances  3uch  that 


Eu  -  0  and  Euu'  »  Diag(  o.^,  o22,..., 

oTT)  *  I  (2.2) 

The  usual  OLS  estimator  of  6  in 
(2.1)  is  given  by 

b  -  (X'X)”1X'Y,  (2.3) 

which  is  unbiased  with  the  variance-co¬ 
variance  matrix  given  by 
V ( b )  -  (X'X)'1X'  IX(X'X)"1  (2.4) 

The  GLS  estimator  of  B  is 

1 .  (x*  r'xr1*'  r1  y  <2-5> 

with  the  corresponding  variance-covari¬ 
ance  matrix 

V  (  8  )  -  (X’  I"1  X)*1.  ( 2 . 6 ) ^ 

It  is  well  known  that  V(b)  -  V  (B)  is 
non-negative  definite. 

In  practice,  l  is  rarely,  if 
ever,  known  hence  B  is  not  operational. 
The  published  literature  contains 
several  techniques  for  relating  the 
heter03cedast ic  variances  to  some 
auxiliary  variables  to  estimate  I,  and 
then  use  "estimated”  GLS  estimators. 
White(1980)  proposes  a  consistent  esti¬ 
mator  of  X'*lx  in  the  middle  of  the 
V(b)  expression  to  develop  a  test  based 
on  OLS  residuals.  His  procedure 
involves  inefficient  OLS  estimator 
since  no  rigorous  estimator  for  l  is 
suggested.  White's  method  fails  if 
may  be  a  cosine  or  sine  wave  in  a  trend 
variable  not  included  as  one  of  the 
regressors.  Cragg(1983)  can  handle 
such  arbitrary  forms,  provided  they  are 
explicitly  stated  in  terms  of  auxiliary 
variables,  not  otherwise. 

2 .  l  MINQE  Style  Estimation  of  l 

To  obtain  efficient  estimates  of 
B  in  (2.1)  using  (2.5)  we  need  to  esti¬ 
mate  £  denoting  by  o  a  Txl  vector  of 
the  diagonal  elements  of  J .  We  have 
u  -  y  -  Xb  -  Mu,  M  -  I-H,  and  H  - 
X ( X  *  X ) “ 1 X ’  (2.7) 
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where  u  Is  a  Txl  vector  of  ordinary 
least  squares  ( OLS )  residuals,  and  H  Is 
the  hat  matrix.  From  (2.2)  note  that 
£  (  u )  -  0,  E ( uu ' )  »  M  l  M. 

Hatching  the  diagonal  elements  we 
have  the  algegraic  result 

£  u  -  Qo,  (2.8) 

where  u  -  u  *  u ,  Q-M  *  M  are  Hadamard 
products,  that  is,  obtained  by  replac¬ 
ing  each  element  of  u  and  M  by  their 
own  squares.  Denoting 
n  -  u  -  Eu,  we  have  a  regression 
equat Ion : 

u  -  Qo  ♦  o .  (2.9) 

where  a  Is  the  set  of  T  unknown 
regression  coefficients.  Although  the 
matrix  Is  singular,  Q  may  be  assumed 
to  be  non-singular.  Now  we  will  note 
that  the  OLS  estimator 

°0LS  "  (Q'Q  "  Q”’u  (2.10) 

Is  unbiased  but  is  Inconsistent  even 
when  X  is  non-stochastlc  and  (X'X)/T  Is 
finite.  In  the  case  where  replicated 
data  are  available,  there  Is  no  Incon¬ 
sistency  problem  with  Rao's  (1970) 

MINQE  as  noted  by  Horn,  et  al  (1975) 
and  J.N.K.  Rao  (1973)-  This  estimator 
Is  given  below. 

Denote  y  a  Txl  vector  of  elements 
J  th 

y ...  where  y  ,  13  the  t  observation 

J  th 

on  the  dependent  variable  for  the  J 

replication,  J-l,  2,...,  J.  Define 

Q  ,  -  ( I-j"1 H)  *  (  I-J*' H)  ♦ 

“  -  2 

(J-l  )  J  H*H  (2.11) 

where  *  denotes  the  Hadamard  product  as 
before.  Now  the  MINQE  [see  Horn,  et  al 
(  1975  ,  p.  381)]  1 3  given  by 

°mq  •  <2-12) 

where  the  summation  Is  from  j-’  to  j-J, 

and  uj  -  Uj  •  Uj,  uj  -  <I-Hj)yj  is  a 

Txl  vector  of  residuals  from  the  j 

replication,  with  elements  u.  .  .  For 

^  J 

J*l,  (2.12)  reduces  to  (2.10).  The 
estimator  In  (2.12)  Is  consistent  when 
J  Increases  without  bound. 

A  well-known  problem  [Horn,  et  al 
(1975)]  with  MINQE  estimator  or 


is  that  it  usually  yields  at  least 

M  Q 

some  negative  estimates  for  a,  which 
should  be  strictly  positive.  J.N.K. 
Rao's  (1973,  1980)  modification  of 
MINQE  called  average  of  squared  residu¬ 
als  (ASR)  solves  the  negative  variance 
problem,  and  has  desirable  mean  squared 
error  (MSE)  properties  and  rather  sim¬ 
ple  computation.  First,  we  compute  for 
each  of  the  J  replicates 

“j  ■  'j  -  H>j 

the  Txl  vector  of  residuals  for  the  J 
replicate.  Now  we  have  the  ASR  esti¬ 
mator  immediately  as 

°ASR  ’  (1/J)  2  (2.1*0 

Since  o,__  Is  based  on  a  sum  of 

ASR 

squares,  it  will  obviously  be  positive. 
Note  that  (2.1*<)  is  an  approximation  of 
MINQE  in  (2.12).  Th  1 3  is  because  Mj  ; 

I  in  (2.12),  by  neglecting  the  terms  of 
the  lower  order  In  T. 

2 • 2  Latin  Square  Style  Replications 
and  Estimation  of  l 

Note  that  the  ASR  estimator 
requires  replicated  observations  which 
are  usually  unavailable  In  econ¬ 
ometrics.  Any  randomized  scheme  of 
generating  replications  Is  usually  un¬ 
acceptable  because  each  run  of  the  same 
date  then  yields  a  different  estimate. 
Though  the  choice  of  our  Latin  Square 
style  is  somewhat  arbitrary,  it  uses 
the  fuzziness  range  of  the  data  to 
yield  unique  results  with  desirable 
properties. 

Following  Vlnod  (1982a)  and  Vinod 
(1982b)  we  note  that  the  observed 
values  of  regressors  can  be  written  as 
hi 

|xtl|  -  0.5(10)  S  |xtl|  S 

0  i 

|xtl|  *  0.99(10)  (2.15) 

where  h^  represents  the  number  of 
"significant”  or  "reliable"  digits  to 
the  right  of  the  decimal  point.  For 
example,  if  the  observed  data  are  29.7 
(hj  -  1)  it  may  have  been  anywhere  In 
the  range  29.65  to  29.799  (-  29.7  ♦  .99 


x  .1)  and  rounded  to  29.7  by  the  usual 


rounding  methods.  We  choose  J  non-- 
stochastic  numbers  from  this  range  to 
construct  our  replications.  Let  Cj 
denote  a  T  x  p  matrix  of  corrections, 

with  elements  c...  lying  in  the  range: 

1 1  j 

(-dj^  to  d^)  or 

ctlJ  di  S  ct 1 j  S  °tij  *  di 

where  d^^  -  0.49  (10)  hi  ,  and  where  j« 
1,2, ...J.  Starting  with  the  observed 
matrix  of  regressors  X  we  construct 
replications  Xj  -  X  +  of  non-- 
stochastic  regressor  matrices.  The 
range  in  (2.15)  is  divided  into  J  equal 
components,  or  may  be  based  on 
quantiles  of  an  appropriate  distribu- 
t  ion . 

In  natural  sciences  the  measure¬ 
ments  involving  temperature,  pressure, 
weight,  etc.  all  have  an  identifiable 
"true"  value,  and  there  is  a  clearcut 
meaning  to  the  word  measurement  error. 
In  econometrics,  for  example,  for  many 
(aggregative)  variables  such  as  gross 
national  product,  implicit  price 
deflator,  unemployment  rate,  etc.  it  is 
fair  to  say  the  "true"  value  of  the 
variable  Itself  is  fuzzy.  All  the 
values  in  a  fuzziness  range  are  almost 
equally  feasible,  and  (market)  agents 
react  to  the  "reported"  values  of  vari¬ 
ables  with  skepticism.  By  contrast,  a 
natural  substance  reacts  to  the  true 
temperature,  weight  etc.  of  another 
substance,  and  the  fact  that  the 
engineer  makes  an  error  in  measuring 
the  temperature-  is  of  no  consequence  to 
the  physical  interaction  of  the  sub¬ 
stances.  This  is  a  major  distinction 
which  must  be  recognized.  The  true 
values  of  the  variables  could  be 
anywhere  in  the  fuzziness  range,  and 
the  market  agents  also  treat  them  as  - 
such.  If  a  fitted  regression  equation 
is  overly  sensitive  to  changes  within 
the  fuzziness  range,  (l.e.,  not  "smoo¬ 
th"  )  the  market  agents  will  reject 


such  models.  The  discussion  of 
measurement  errors  from  natural  sci¬ 
ences  needs  to  be  modified  in  some 
social  science  applications. 

In  spectral  analysis  of  time 
series  assumption  that  the  spectrum 
should  be  smooth  is  similar  to  Vinod’s 
(1982)  assumption  that  the  regression 
estimates  should  be  "smooth",  or  not 
too  sensitive  in  the  fuzziness  range. 
One  does  not  reject  the  spectral  ana¬ 
lysis  or  kernel  estimates  of  density 
functions  simply  because  there  is  a 
large  variety  of  plausible  "window" 
specifications.  It  13  well-  known  that 
in  agricultural  experiments  the  Latin 
Square  design  eliminates  the  "fertility 
gradient"  associated  with  the  rows  and 
columns  of  agricultural  plots.  We  pro¬ 
vide  unique  estimates  for  given  data, 
and  eliminate  the  effect  of  two  coordi¬ 
nates  associated  with  the  specific 
observation  number  (rows),  and  the 
specified  variable  (column)  used. 

Since  the  fuzziness  in  the  depend¬ 
ent  variable  y  is  essentially  similar 
to  regressor  fuzziness,  it  is  con¬ 
venient  to  augment  the  X  matrix  by 
including  the  additional  column  for  y. 
We  write  Xa-  [ X :  y  ] .  The  choice  of  J, 
the  number  of  replications  to  be  creat¬ 
ed  depends  on  several  practical  con¬ 
siderations.  The  J  should  be  larger 
than  p*l,  enough  to  provide  consistent 
estimates,  and  small  enough  to  impose  a 
reasonable  computational  burden.  If  J 
is  an  integer  multiple  of  p*l  the  con¬ 
struction  of  (p  +  1)  x  (p+1)  Latin  Square 
specification  is  most  convenient.  We 
describe  the  simplest  case  of  p-3,  J-4, 
T-8  for  convenience. 

The  assumed  fuzziness  range  in 
(2.15)  is  divided  into  J  equal  parts. 
For  our  example  with  J-4,  let  us  choose 
four  constants  k  -  -d  ,  k  •  -d.  /2, 
kc  ■♦dj/2,  and  k d - d ^  .  Now  the  original 
data  of  the  (augmented)  X  matrix  of 


dimension  8x4  are  modified  by  adding 
an  8  x  4  matrix  to  yield  the  first 
replication.  For  convenience  we  report 
only  the  subscripts  of  k  in  a  8  x  4 
matrix  which  represents  the  first 
replication,  j-1  . 

r  a  b  c  d"1 
b  c  d  a 
c  d  a  b 
d  a  b  c 
abed 
b  c  d  a 
c  d  a  b 
v  d  a  b  c  ; 

Note  that  we  have  used  two 
"standard"  4x4  Latin  Squares  to 
generate  the  8x4  matrix.  If  T-9  it 
is  obvious  that  an  additional  row  will 
have  (abed)  as  subscripts  of  k.  For 
any  T  each  column  above  will  be  a  str¬ 
ing  of  (  a  b  c  d  )  with  appropriate 
starting  points.  For  the  second  repli¬ 
cation  J-2  the  subscript  of  k  in  the 
top  left  corner  is  b  and  the  first  row 
is  (  b  c  d  a).  Each  column  is  now  a 
string  of  (  a  b  c  d  )  with  these  start¬ 
ing  values.  For  the  third  and  fourth 
replication  the  additive  constants 

start  with  k  and  k..  If  J-5,  we  would 
c  d 

cycle  the  subscripts  (abode).  In 
general,  one  can  devise  such  strings 
for  any  J.  Denoting  the  jth  replica¬ 
tion  of  X  by  X ^  and  the  jth  replication 
of  y  by  y ,  we  can  thus  generate  J  sets 
of  replicated  data. 

Each  replication  of  Xj  will  imply  a 
new  well  known  idempotent  matrix  Mj  - 
^  “  Hj ]  with  the  residual  vector  Uj  ■ 

yJ  '  HJyJ’  where  Hj  -  xj(x'jxj)~,x'j 
is  the  hat  matrix.,  and  this  is  used  in 
(2.14)  to  get 

°asr-(1/j)Ev vyj'Vj  (2,16) 

An  alternative  derivation  of  the  pro¬ 
posed  estimator  in  (2.16)  is  given 
below . 

For  the  Jth  replication  let  us 


write  from  (2.9) 

uj  ■  v  +  v  (2-l7> 

Further,  for  j-1 . J  we  can  write 

(2.17)  in  the  pooled  regression  form  as 
u«  -  Q*o  +  n*  (2.18) 

where  u*-  [u* 1 , . . . , u' j] '  ,  Q*  - 

[Q,1,...,Q'j]'  and  n*  «  [  n  *  1  , . .  • » n '  j  3 ' . 
The  pooled  OLS  estimator  of  o  in  (2.18) 
is  then 

o  -  (Q*'Q*)‘1Q*'u«  (2.19) 

Now  note  that  *  I,  by  neglecting  the 
terms  of  the  lower  order  in  T.  Thus, 
substituting  Mj  *  I,  we  get  an  appro¬ 
ximation  of  (2.19)  which  is  given  in 
(2.16). 

3.  GLS  ESTIMATION  AND  A  TEST  FOR 
HETEROSCEDASTICITY 
3 . 1  "Estimated"  GLS  Estimation 

He  note  that  the  oASR  in  (2.16) 
generates  an  estimate  £  of  the  diagonal 
matrix  of  he teroscedast i c  variances,  by 
substituting  oASR  t  value  as  the  tth 
term  along  the  diagonal.  Thus  we  do 
not  need  a  further  uncertain  search  for 
appropriate  combination  and/or  trans¬ 
formation  of  auxiliary  variables  to 
estimate  the  heter03ceda3t ic  variances. 

To  obtain  the  GLS  estimator  we 
proceed  as  follows.  The  J  replications 
of  the  Txp  data  matrix  X  are  rearranged 
such  that 

y ,  -  X  8  *  u  ,  J-1.  .  .  ,J  (3.1) 

-  J  J  th 

where  Xj  is  the  J  Txp  matrix  of  J 
replicated  observations  on  p  regress¬ 
ors,  and  Uj  is  a  Txl  disturbance  vector 
such  that 

EUj  -  0  and  EUjU'j  -  l  (3-2) 

The  model  in  (3.1)  can  be  written  in  a 
more  compact  form  as 

y*  -  X* 8  ♦  u*  (3-3) 

where  y*  -  (y '  j  ,  y  '  2 . y  '  j)  .  x* 

-C  X  *  x . X*j]’,  and  u*  - 

[u'1,...,u'J]'  such  that 

Eu*  -  0  and  Eu*u*’  -  I  (x)  Ij  (3.4) 

The  estimated  GLS  estimator  is 
then  given  by 


(x»>  (I  JJD  l  "1)X«)",X»’  (I  ®  l 


where  o 


<3-5) 

„B  Is  given  in  (2.16). 


It  follows  from  a  theorem  in 
Fuller  and  Rao(1978,  p.  1152)  that  the 
estimator  b*  is  consistent  and  that  / 
(JT)  (b*  -  8)  has  a  limiting  normal 
distribution  with  mean  0  and  covariance 
matrix  for  J  i  3  given  by  a  limit  as  T 
increases  without  bound 
lira  JT[ ( 1+2J-1  -  8j'2)  (X*'  Ij  ®  I”1 

X*)"1  +  4J~2fl  ]  (3.6) 


Q.(X*’X*)“1X*’IJ  (x)  ix*(X*,X*)'1(3.7) 
Thus  the  asymptotic  variance-covariance 
matrix  of  b*  is 

V(b»)  -  (1  ♦  2 j”*1  -  8J-2)  (X*’  I'1® 


V(b»)  -  (1  +  2J  1  -  8J  )  (X*’  l  (x) 
IjX*)-1  +  4j"2Q  (3.8) 

and  for  large  J 

V(b*)*(X*'i_1  (*>  IjX*)'1  (3-9) 

One  method  of  checking  whether  the 
estimated  GLS  estimation  has  improved 
the  matters  is  to  find  the  eigenvalues 
of  the  difference  between  the  asympto¬ 
tic  covariance  matrix  of  the  OLS  esti¬ 
mator  b  in  (2.4),  and  that  of  b*.  If 
the  efficiency  has  been  improved  the 
eigenvalues  of  the  estimated  difference 
V(b)  -  V(b»)  should  be  positive.  A  more 
crude  assessment  may  be  based  on  a  com¬ 
parison  of  standard  errors  of  regress¬ 
ion  coefficients,  if  computer  programs 
for  eigenvalue  computation  are  unavail¬ 
able. 

3 . 2 A  Test  for  He t erosceda 3 1 1 c 1 1 y 

A  simple  test  is  proposed  here 
for  the  null  hypothesis  of  homoacedas- 
t  i  c  i  t  y 

HOI0ll  *  0  2  2  '  "  °TT*°2  (3’10) 

From  the  normality  assumption  on  u  it 

r  2  2 

is  clear  that  )  u  ^  ,  /  0 


Is  a  Chi-3quar 


random  variable  'Xj  with 


J  degre- 


f  r  •  r- 1  o  m  (  d  f  )  . 


Even  under 


the  n  u  ’.  1  hypothesis  (3-10)  of 
homoscedas  t  i  c- 1 1  y  ,  our  estimates  of  ott 
will  be  random  variables  and  will  not 
all  be  identical  to  each  other.  If  oc 


V 


is  known,  estimates  Jo,CB  ./o  ■ 

2  o  A  n  (  v 

l  /o  will  be  random  observations 

U  J  2 

from  a  “K,  parent  distribution  for 
J  2 

large  T.  If  0  is  not  known,  then 
again  under  the  null  hypothesis  the 
ratios  wA$R(t  defined  by  JaASR(t./s2  can 
be  considered  as  observations  from  a 

random  variable  which  is  approximately 

2 

Chi-square  with  J  df;  s  is  an  estimate 

2 

of  the  common  variance  0  given  by 
s2  -  (y-Xb) ' (y-Xb)/(T-p) .  (3.11) 

Then  the  empirical  distribution  func¬ 
tion  (edf)  of  these  ratios  should 
2 

resemble  that  of  a  “X  variable.  A 
J 

"goodness  of  fit”  test  procedure  as  in 
Kendall  and  Stuart  (1979,  Ch.  30)  is  to 
rearrange  the  above  ratios  in  an 
ascending  order  of  magnitude: 

WASR,1  S  S  w ASR , T *  (3.12) 

Next  we  evaluate  the  cumulative  density 
2 

of  Xj  variable  denoted  by  ZflSR 
evaluated  at  the  "order  statistics”  in 
(3.12) . 

Now  the  Cramer-von  Mises  test 
statistic  is 

W2-(l/12T)*I(ZASRi t-(2t-l)/2T)2. (3.13) 
where  summation  is  from  t«1  to  t-T. 

The  5f  point  for  thi3  statistic  is  bas¬ 
ed  on  an  approximation,  and  for  large 
samples  is  0.461.  Thus  there  i3  no 
need  to  look  up  any  tables.  This  test 
is  intended  to  be  a  refinement  to 

Bartlett's  well-known  test  for  homosce- 
2 

dasticity.  The  W  test  has  higher 

power  than  the  well-known  Kolmogorov 

statistics  based  on  the  largest 

absolute  differences  instead  of  the 

square  terms  in  (3.13).  Clearly,  even 

2 

further  refinements  to  W  test  are 
possible,  but  may  not  make  a  practical 
d  i  f f erence  . 

AN  ILLUSTRATION 

The  theoretical  developments  dis¬ 
cussed  above  are  illustrataed  with  the 
help  of  an  example  from  Pindyck  and 
Rubinfeld  (1981,  p.  169).  The  results 
show  considerable  reduction  in  standard 
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errors  with  similar  magnitudes  of 
regression  coefficients.  The  details 
can  be  obtained  from  the  authors  upon 
request . 

5.  CONCLUSION 
We  show  that  a  minimum  norm 
quadratic  estimator  (MINQE)  of  hetero- 
scedastlc  variances  can  be  derived  by 
using  non-stochastic  replications  from 
a  fuzziness  range  discussed  in  Vinod 
( 1982a , b  K  Further,  using  this  we  pro¬ 
vide  the  efficient  "estimated"  GLS 
estimator  of  the  regression  coeffi¬ 
cients. 

The  proposed  test  is  based  on  the 

fact  that  the  ratio  of  ASR  estimate  of 

heteroscedast  ic  variance  (based  on  our 

Latin  Square  type  replications  of  X,  y 

2 

data)  to  the  usual  estimate  of  s  of 
residual  variance,  is  approximately 
Chi-square  variable  with  J  (number  of 

replications)  degrees  of  freedom.  We 

2 

use  the  Cramer-von  Mises  W  test  of 
"goodness  of  fit"  for  observed  order 
statistics  of  the  ratios  mentioned 
above.  The  proposed  estimation  is  the 
two-step  GLS  procedure.  Thl3  has  been 
introduced  in  the  case  of  completely 
unknown  form  of  the  he t eroscedas t  i c  1  ty . 
We  avoid  the  potential  specification 
error  in  choosing  auxiliary  regressor 
or  other  variables  related  to  hetero- 
scedasticity  testing  or  estimation 
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1  Introduction 

The  statistics  community  is  beginning  to  take  advan¬ 
tage  of  the  vast  computational  ability  that  is  now  sur¬ 
facing  in  the  computer  industry.  Many  new  algorithms 
and  techniques  now  exist  that  were  impractical  or  im¬ 
possible  to  execute  on  computers  of  just  a  few  years  ago. 
Moreover,  the  enormous  data  acquisition  and  storage 
capabilities  of  the  computer  have  allowed  statisticians 
to  confront  problems  that  are  increasingly  more  com¬ 
plex,  both  in  the  number  and  type  of  questions  asked 
and  in  the  size  and  structure  of  the  data  base  used  to 
answer  the  questions.  The  pencil-and-paper  methods 
that  were  characteristic  of  statistics  twenty  to  thirty 
years  ago  and  that  are  still  taught  in  many  of  today’s 
textbooks  are  inadequate  to  deal  with  such  problems 
because  these  methods  are  largely  concerned  with  sim¬ 
ple  models  and  small  data  sets.  Computerization  and 
extension  of  these  methods  have  allowed  more  complex 
models  and  larger  data  sets  to  be  analyzed,  but  the  line 
of  development  characterized  by  batch  computing  and 
the  use  of  statistical  packages  has  been  exhausted  both 
in  efficiency  and  capability. 

The  purpose  of  this  paper  is  to  briefly  state  the  statis¬ 
tical  computing  environment  (hardware,  operating  sys¬ 
tem,  and  application  software  capabilities)  that  is  nec¬ 
essary  for  a  statistician  or  an  experienced  data  analyst 
to  efficiently  deal  with  today’s  problems  and  to  select 
a  currently  available  environment  which  we  judge  best 
meets  our  criteria.  Because  of  rapid  developments  in 
hardware  and  software,  any  selection  may  very  likely 
be  outdated  (i.e.  able  to  be  improved  upon)  in  three 
or  four  years.  Hence,  it  is  important  to  choose  extend¬ 
able  software  and  updateable  hardware,  both  of  whose 
course  of  growth  has  been  and  will  continue  to  be  on 
the  state-of-the-art  development  trajectory. 

Some  broad  requirements  of  a  statistical  computing 
environment  are  given  in  Section  2.  Operating  systems 
and  statistical  software  are  discussed  in  Section  3.  Some 
techniques  that  are  currently  being  used  by  researchers 
in  statistical  computation  are  given  in  Section  4.  The 
new  equipment  needed  to  implement  these  techniques 
is  given  in  Section  5.  In  Section  6  we  give  our  specific 
choice  of  available  hardware  and  software  for  our  statis¬ 
tical  computing  environment.  Since  we  are  statisticians 
and  not  computer  scientists,  our  comments  are  focused 
on  the  functionality  of  the  computing  environment  and 
not  on  the  technical  aspects  of  the  software  and  hard¬ 
ware.  References  are  provided  to  the  technical  computer 
science  details. 


2  A  Statistical  Computing  Environment 

For  the  purpose  of  this  paper,  a  computing  environ¬ 
ment  consists  of  hardware,  an  operating  system,  and 
software;  the  distinction  between  the  last  two  is  not  al¬ 
ways  clear.  A  statistical  computing  environment  should 
aid  the  model  development  process  and  the  presentation 
of  intermediate  and  final  results.  The  model  develop¬ 
ment  process  includes:  model  specification,  estimation, 
and  criticism.  The  model  building  process  is  iterative, 
the  number  and  size  of  the  iterations  being  determined 
in  part  by  the  complexity  of  the  problem  and  the  data 
set.  It  requires  the  ability  to  cull  large  data  sets  to 
find  the  relevant  data,  to  display  complex  structure  in 
multidimensional  data,  to  interactively  direct  the  course 
of  the  analysis,  and  to  carry  out  computationally  inten¬ 
sive  methods.  The  computing  environment  should  allow 
quick  and  efficient  passage  through  the  model  building 
process,  especially  in  the  early  stages  when  the  problem 
is  not  well-defined.  Capability,  in  the  diminished  sense 
of  just  being  able  to  perform  a  task  (irrespective  of  the 
amount  of  time  it  takes) ,  is  not  adequate.  Fast  execu¬ 
tion  (less  than  a  second)  of  commands  is  essential  for 
full  productivity  of  the  analyst.  Quick  response  time 
is  needed  so  as  not  to  inhibit  problem  solving  activity. 
High  speed  is  also  essential  for  full  use  of  interactive 
graphics,  one  of  the  most  powerful  tools  in  exploratory 
data  analysis.  McDonald  and  Pedersen  (1985)  point  out 
that  to  draw  a  three-dimensional  scatterplot  contain¬ 
ing  1000  points  could  require  the  graphics  processor  to 
draw  at  a  speed  of  at  least  3  million  pixels  per  second. 
High  speed  color  graphics  requires  very  fast  processing 
and  large  storage  capabilities.  McDonald  and  Pedersen 
(1985)  give  guidelines  on  the  computational,  graphics 
processing,  and  graphics  display  speeds. 

The  development,  use,  and  maintenance  of  software 
are  central  issues  to  the  development  and  application  of 
statistics  and  data  analysis  techniques.  Solutions  to  sta¬ 
tistical  problems  often  require  that  standard  techniques 
be  put  together  in  slightly  new  ways  tailored  to  the  spe¬ 
cific  problem  of  interest.  A  good  statistical  computing 
environment  will  aid  this  activity. 

A  common  tool  for  analysis  is  a  statistical  package 
developed  for  a  batch  mode  environment  (in  contrast 
to  an  interactive  environment).  Typically  a  statistical 
package  provides  a  set  of  commands  to  carry  out  certain 
procedures.  The  commands  usually  produce  a  prede¬ 
termined  output  and  usually  cannot  be  combined  with 
other  commands  to  form  new  procedures.  Flexibility 
or  fine  tuning  of  the  command  is  achieved  through  a 
predetermined  list  of  options.  In  anticipation  of  various 
possible  outcomes,  a  typical  style  of  analysis  is  to  re¬ 
quest  “everything”  from  the  commands  in  a  statistical 
package,  in  order  to  submit  the  command  once.  This 


“shotgun”  approach  to  analysis  forces  the  user  to  look 
through  the  output  for  the  relevant  pieces.  For  the  in¬ 
experienced  analyst  or  the  infrequent  user  of  statistics, 
such  an  approach  may  be  beneficial  if  it  teaches  the 
user  about  a  new  technique  or  makes  him  aware  of  the 
limitations  of  the  analysis  and  the  data.  However,  the 
experienced  analyst  (for  whom  the  computing  environ¬ 
ment  described  here  is  intended)  generally  wants  more 
direct  and  immediate  control  over  the  analysis  and  does 
not  want  to  be  forced  into  a  particular  mode.  In  the 
analysis  of  very  large  data  sets  it  is  impossible  to  effec¬ 
tively  use  the  shotgun  approach.  Where  the  statistician 
may  prefer  to  interactively  direct  the  course  of  analy¬ 
sis  and  select  output  displays  of  interest,  such  capability 
may  be  of  limited  use  to  the  inexperienced  analyst.  The 
style  of  analysis  provided  by  a  statistical  package  may 
be  the  best  way  for  such  a  person  to  analyze  data,  but 
it  is  only  one  mode  the  statistician  may  choose. 

The  rather  fixed,  stand  alone  nature  of  the  com¬ 
mands  in  a  statistical  package  does  not  allow  the  easy 
creation  of  new  procedures.  One  is  usually  restricted 
only  to  what  the  package  can  do.  An  example  of  a 
statistical  language  which  does  not  have  many  of  the 
limitations  of  statistical  packages  is  S,  an  interactive 
environment  for  data  analysis  and  graphics.  See  Becker 
and  Chambers  (1985)  for  a  good  review  of  the  philos¬ 
ophy  and  capability  of  S.  S  is  both  a  very  high-level 
language  for  doing  computations  and  is  an  environment 
which  supports  data  management,  documentation,  and 
graphics.  S  graphics  does  not  currently  make  use  of  the 
latest  high  resolution,  high  speed  raster  graphics  display 
devices  available  on  workstations,  but  future  plans  are 
aimed  in  that  direction. 

Analysis  of  data  also  frequently  requires  that  new 
programs  be  created,  so-called  software  engineering.  Pro¬ 
grams  can  be  created  by  putting  together  existing  soft¬ 
ware  or  by  writing  entirely  new  software.  A  statistical 
computing  environment  should  recognize  this  need  and 
provide  software  development  and  debugging  tools  to 
efficiently  produce  new  software.  Since  a  resulting  pro¬ 
gram  is  often  specific  to  a  problem  and  may  not  be  of 
general  use,  it  is  very  important  that  the  development 
be  done  quickly.  If  it  takes  too  much  time,  it  will  not 
be  done,  possibly  lowering  the  quality  of  the  analysis. 
Ideally,  the  new  software  should  be  developed  in  the 
context  of  a  statistical  language,  such  as  S,  so  that  al¬ 
ready  existing  input/output  and  graphics  routines  can 
be  easily  used  and  development  time  can  be  reduced. 

Graphical  displays  are  an  absolute  necessity  for  data 
analysis.  As  noted  by  Chambers  et  al.  (1985),  “there 
is  no  single  statistical  tool  that  is  as  powerful  as  a  well- 
chosen  graph.”  The  human  mind  is  far  superior  to  any 
computer  system  in  its  ability  to  detect  patterns.  The 
use  of  color,  dynamic  displays,  and  other  enhancements 
to  a  graph  have  great  potential  in  aiding  the  analysis  of 
the  data.  Creating  graphics  to  exploit  the  great  pattern 
discovery  capabilities  of  the  human  visual  system  is  an 
area  of  current  research. 

Documentation  of  results  and  the  ability  to  produce 
manuscripts  of  high  typographic  quality  in  a  timely  fash¬ 
ion  are  essential  to  the  success  of  a  statistical  project. 


High  typographic  quality  is  required  since  statistical  pa¬ 
pers  typically  consist  of  mathematical  equations  with 
Greek  letters,  special  symbols,  subscripts,  superscripts, 
arrays,  tables,  charts,  and  the  like.  The  production  of 
such  papers  have  typically  been  done  by  secretaries.  As 
capable  as  they  may  be,  the  system  of  first  writing  the 
paper  by  hand  and  then  giving  it  to  a  secretary  to  type 
is  inefficient.  Often,  especially  in  mathematical  writing, 
new  errors  are  introduced  into  the  paper  in  transcrip¬ 
tion.  Much  time  may  be  consumed  in  explaining  what 
is  meant  and  how  it  should  be  typed.  The  researcher 
may  be  reluctant  to  revise  because  of  increased  burden 
to  the  secretary  who  must  typically  service  many  peo¬ 
ple.  A  statistical  computing  environment  must  include 
interactive  tools  that  enable  the  analyst  to  document 
results  and  produce  high  quality  papers. 

Coincident  with  the  documentation  of  results  is  the 
presentation  of  results.  Written  material  and  graphi¬ 
cal  displays  may  be  presented  on  paper,  transparencies 
(foils),  slides,  movie  film,  or  some  other  media.  Access 
to  appropriate  output  devices  is  necessary. 

Since  many  statistical  projects  are  interdisciplinary, 
e.g.  involving  engineers,  statisticians,  and  other  scien¬ 
tists,  data  and  results  must  be  easily  communicated. 
This  can  easily  be  done  by  means  of  a  computer  net¬ 
work.  Data  can  be  transferred  via  tape,  but  the  time, 
bother,  and  inaccessability  of  tape  inhibits  its  use.  Net¬ 
works  encourage  frequent  communication  of  rather  small 
messages  and  allow  for  the  rapid  dissemination  of  infor¬ 
mation.  Networks  are  also  important  in  the  configu¬ 
ration  of  hardware.  This  will  be  discussed  in  Section 
5. 

A  good  statistical  computing  environment  must  al¬ 
low  the  statistician  to  interactively  formulate  models, 
to  efficiently  and  quickly  analyze  data,  to  document 
and  display  results,  and  to  communicate  these  results  at 
all  stages  of  the  analysis.  Such  an  environment  would 
provide  fast  execution  time  and  fast  program  develop¬ 
ment  time.  As  has  been  noted,  a  fast  machine  is  not 
enough.  The  hardware  and  software  must  be  consid¬ 
ered  together. 

3  Operating  Systems  and  Software 

A  statistical  research  group  often  develops  software 
to  develop  and  implement  new  techniques.  A  statisti¬ 
cal  consulting  group  often  needs  to  modify  or  combine 
standard  programs  to  solve  a  particular  problem.  Both 
groups  need  an  operating  system  with  tools  that  facil¬ 
itate  such  activities.  Since  the  operating  system  runs 
programs,  manages  the  computer’s  resources,  and  pro¬ 
vides  an  interface  between  the  computer  and  the  user, 
the  choice  of  an  operating  system  is  crucial  to  the  effi¬ 
ciency  and  productivity  of  the  statistician.  An  excellent 
and,  in  our  opinion,  the  best  available  operating  sys¬ 
tem  for  statistical  work  is  the  multi-user,  multi-tasking 
UNIX  operating  system.  UNIX  was  developed  by  a 
group  of  programmers  at  Beil  Labs  for  their  own  use.  In 
contrast  to  operating  systems  developed  by  computer 
vendors,  UNIX  had  a  rather  long  gestation  period  in 
academic  and  research  oriented  environments  before  it 


become  commercial.  The  long  period  of  development 
in  a  protected  environment  promoted  the  development 
of  new  ideas  and  the  discarding  of  bad  ones.  The  re¬ 
sult  was  a  very  flexible  environment  rich  in  utilities  and 
tools.  For  example,  the  UNIX  system  comes  with  text 
editors,  graphics  routines,  pattern  scanners,  languages 
such  as  C,  FORTRAN,  and  PASCAL,  and  document 
preparation  utilities,  to  name  a  few.  The  modular  de¬ 
sign  of  UNIX  and  the  fact  that  it  is  written  in  the  C 
language,  allow  it  to  be  customized  to  meet  the  user’s 
needs.  The  statistical  language  S  was  developed  in  a 
UNIX  environment. 

At  present,  there  are  two  versions  of  UNIX:  UNIX 
4.2BSD  with  enhancements  and  improvements  from  Uni¬ 
versity  of  California,  Berkeley,  and  SYSTEM  V,  the  ver¬ 
sion  provided  by  AT&T’s  Bell  Labs.  The  Berkeley  ver¬ 
sion  is  suited  to  researchers,  while  the  Bell  Labs'  version 
is  aimed  at  the  commercial  market.  Work  is  presently 
underway  to  merge  these  two  versions  into  one  version 
that  will  be  called  SYSTEM  V.  Presently,  the  Berkeley 
version  is  preferred  for  a  statistical  research  computing 
environment. 

Aside  from  the  operating  system  other  software  main¬ 
stays  of  a  statistical  computing  environment  are:  S, 
APL,  IMSL,  a  graphics  package,  the  linear  algebra  sub¬ 
routines  in  Linpack  and  Eispack,  and  a  document  prepa¬ 
ration  utility  such  as  troff  in  UNIX  or  T^X  .  This  soft¬ 
ware  can  be  easily  installed  on  a  UNIX  system. 

4  Current  Techniques 

Some  of  the  most  well-known  applications  of  graphics 
workstations  have  been  provided  by  the  Prim-9,  Prim- 
11,  and  Orion  I  projects.  These  projects  and  the  compu¬ 
tationally  intensive  Interactive  Projection  Pursuit  Re¬ 
gression  project  (McDonald.  1982) ,  feature  graphics  with 
real  time  motion.  The  latter  project  is  a  sophisticated 
example  of  interactive  model  fitting.  These  techniques 
make  extensive  use  of  the  workstation’s  fast  numeri¬ 
cal  and  graphics  processing  capabilities.  The  success  of 
the  technique  is  dependent  on  interaction  with  the  user. 
Hence,  the  feasibility  and  the  success  of  these  techniques 
are  dependent  on  the  computing  environment. 

The  statistical  computing  environment  is  important 
not  only  for  the  above  applications,  but  also  for  com¬ 
monly  used  techniques  such  as  regression.  Much  is  known 
about  the  pitfalls  and  limitations  of  regression.  For  ex¬ 
ample,  in  the  analysis  of  lines  r  models  good  statisti¬ 
cal  practice  dictates  that  one  examine  residuals,  search 
for  outliers  and  influential  observations,  detect  exact 
and  near  multicollinearies,  and,  in  general,  subject  the 
model  to  severe  criticism.  Since  these  techniques  can¬ 
not  be  totally  automated,  the  computing  environment 
must  assist  in  carrying  out  these  tasks  with  a  high  de¬ 
gree  of  interaction.  Thus,  in  order  to  fully  apply  known 
statistical  techniques  to  even  conventional  procedures, 
a  powerful,  highly  interactive  system  is  needed. 

A  multivariate  plotting  routine  at  General  Motors 
Research  Laboratories  (GMR)  provides  an  example  of 
the  increase  of  power  and  usefulness  that  interactive 
graphics  capability  can  add  to  a  static  graphics  display. 


The  intended  residence  of  such  interactive  graphics  soft¬ 
ware  is  a  professional  workstation.  The  multivariate 
plotting  routine  had  its  origins  in  a  procedure  which 
plots  multivariate  data  as  a  matrix  of  pairwise  scatter- 
plots.  The  use  of  color,  smoothing  of  data,  choice  of 
plotting  symbols  and  their  sizes,  and  other  graphical 
enhancements  make  the  graph  very  useful  for  detect¬ 
ing  patterns.  A  similar  capability  exists  with  the  pairs 
command  in  S.  Interactive  graphics  capability  allows 
the  user  to  immediately  modify,  highlight,  select,  etc., 
as  he  thinks  of  them.  In  a  static  situation,  one  must  do 
all  thinking  prior  to  reissuing  the  command  to  produce 
the  new  graph. 

5  Hardware 

The  statistical  computing  environment  described  in 
Section  2  is  quite  specialized  to  the  needs  of  the  pro¬ 
fessional  statistician  and  the  experienced  data  analyst 
and  so  may  not  meet  the  needs  of  the  general  comput¬ 
ing  community.  The  computing  tasks  of  the  general 
community  are  usually  done  on  a  mainframe  computer 
with  a  time-sharing  operating  system.  A  mainframe 
must  have  the  capacity  and  auxiliary  devices  to  meet 
the  needs  of  its  user  community,  and  the  time-sharing 
environment  must  allocate  these  resources  in  an  effi¬ 
cient  and  equitable  way.  Compromises  are  inevitable. 
The  environment  described  in  Section  2  will  sometimes 
require  that  all  the  available  resources  be  uncompro¬ 
misingly  given  to  one  person.  Since  by  design  a  time¬ 
sharing  environment  does  not  let  that  happen,  it  is  not 
the  appropriate  environment  for  the  statistical  tasks  de¬ 
scribed  here.  What  one  needs  is  one’s  own  computer. 
Fortunately,  the  cost  and  architecture  of  today’s  super 
microcomputing  systems,  or  graphics  workstations,  pro¬ 
vide  just  that.  The  philosophy  is  not  to  build  bigger 
and  bigger  computers  to  meet  the  ever  expanding  spe¬ 
cialized  needs  of  different  user  groups,  but  to  build  spe¬ 
cialized  machines  and  link  them  via  a  network.  The 
network  may  contain  mainframe  computers,  multi-user 
computers,  personal  computers,  midicomputers,  super 
computers,  printers,  tape  drives,  fileservers,  and  gate¬ 
ways  to  other  networks.  McDonald  and  Pedersen  (1985) 
give  the  hardware  requirements  of  a  statistical  graphics 
workstation.  Joy  and  Gage  (1985)  give  an  overview  of 
the  impact  of  the  new  hardware  on  scientific  computing. 

The  hardware  requirements  needed  to  support  the 
statistical  computing  environment  being  described  rule 
out  the  personal  computer  as  a  possible  computing  de¬ 
vice.  The  personal  computer  is  characterized  by  slow 
processor  speed,  limited  addressable  memory,  and  prim¬ 
itive  (proprietary)  operating  systems.  However,  these 
characteristics  are  changing,  so  that  the  capabilities  of 
the  personal  computer  are  evolving  at  different  rates  to¬ 
wards  those  of  the  workstation.  The  current  graphics 
of  the  personal  computer  have  too  few  colors  and  too 
coarse  of  a  resolution.  Much  of  the  software  written  for 
mainframes  will  not  run  on  personal  computers.  Work¬ 
stations  do  not  have  these  hardware  and  software  lim¬ 
itations.  Since  the  workstation  has  only  one  user,  the 
human  interface  can  be  designed  to  increase  the  pro- 
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ductivity  of  the  user  and  help  him  concentrate  on  his 
problem  rather  than  on  computer  science. 

That  one  person  at  a  time  uses  a  workstation  does 
not  imply  that  the  user  is  isolated.  As  noted  previously, 
workstations  can  be  configured  in  a  network.  Since  a 
network  allows  the  fusion  of  different  machines,  it  is  im¬ 
portant  that  the  network  be  designed  following  industry 
standards  so  that  the  greatest  number  and  variety  of  the 
available  computer  hardware  are  compatible  with  it.  If 
a  vendor  proprietary  networking  system  is  chosen,  then 
one  may  be  locked  into  that  vendor’s  hardware.  An 
open  architecture  philosophy  allows  one  to  more  read¬ 
ily  acquire  the  latest  hardware  and,  likewise,  to  more 
easily  sell  it.  It  also  encourages  competition  among  the 
silicon  valley  upstarts,  thereby  improving  quality  and 
reducing  prices. 

6  Vendors  for  Statistical  Computing  En¬ 
vironment  Hardware  and  Our  Final 
Selection 

The  criteria  for  the  choice  of  a  microcomputing  net¬ 
work  for  statistical  computing  may  be  summed  up  as 
follows: 

•  high  resolution,  high  speed  graphics; 

•  UNIX  operating  system; 

•  a  rich  environment  for  program  development  and 
maintenance; 

•  large  random  access  memory  and  disk  storage  to 
handle  very  large  data  sets,  (one  data  set  which  we 
will  analyze  consists  of  a  million  observations  with 
about  15  variables); 

•  networking  capability; 

•  multi-color  graphics  (at  least  enough  for  shading) . 

The  vendors  that  we  personally  contacted  were  DEC 
with  microVAX  II  and  VAXSTATION  520,  APOLLO 
with  560  and  660,  and  SUN  with  SUN-3.  Other  candi¬ 
date  vendors,  such  as  Chromatics,  Iris,  Ridge  32,  and 
Symbolics  3600,  were  not  investigated  due  to  local  un¬ 
availability  and  the  limited  time  of  the  search  team.  The 
final  decision  was  to  choose  SUN-3. 

The  DEC  offerings  were  rejected  because  of  the  slower 
processing  speed  and  the  diminished  graphics  capability 
relative  to  SUN.  DEC’s  graphics  workstation  is  really 
a  Tektronix  color  monitor  attached  to  a  monochrome 
VAXstation.  The  graphics  workstation  does  not  sup¬ 
port  a  diskless  node  and  it  does  not  support  ULTRIX- 
32m,  DEC’s  version  of  UNIX  4.2BSD.  Remote  login 
from  a  microVAX  with  ULTRIX  to  a  machine  with  a 
VMS  operating  system  (DEC’s  proprietary  operating 
system)  was  not  possible  at  the  time  of  our  investiga¬ 
tion.  Networking  would  be  accomplished  with  DEC’s 
proprietary  networking  system  DECNET. 

APOLLO  was  a  closer  contender  to  SUN.  At  the 
time  of  our  visit  to  APOLLO,  the  company  did  not 
have  the  Motorola  68020  chip  (MC68020)  in  its  high 
performance  color  graphics  workstation.  APOLLO  was 
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rejected  because  their  UNIX  operating  tytem  was  not 
completely  independent  of  their  proprietary  operating 
system  AEGIS.  As  described  in  their  manual  (APOLLO, 
1085),  one  “should  be  familiar  with  both  UNIX  and 
AEGIS  software,  as  well  as  DOMAIN  networks”  in  or¬ 
der  to  use  the  system.  The  top  of  the  line  color  graphics 
workstation,  the  DN660,  had  a  32-bit  bit-slice  proces¬ 
sor  and  could  support  a  maximum  of  8  Mb  of  Ran¬ 
dom  Access  Memory  (RAM).  At  the  time  of  our  visit, 
this  machine,  though  still  being  sold,  was  in  the  pro¬ 
cess  of  being  phased  out  in  favor  of  a  machine  built 
on  the  68020  chip.  The  DN460,  the  high  performance 
monochrome  workstation,  could  support  only  3  Mb  of 
main  memory.  Warranty  analysis  at  GMR  requires  at 
least  15  Mb  of  RAM.  A  paint  attribute  surface  represen¬ 
tation  program  at  GMR  can  easily  consume  16  Mb  of 
RAM.  The  APOLLO  equipment  does  not  match  SUN’s 
speed  and  memory  capacity.  A  further  disadvantage  of 
the  APOLLO  equipment  is  that  it  is  designed  to  run  on 
its  own  proprietary  network. 

The  SUN  equipment  most  fully  met  our  criteria.  The 
SUN  workstations  are  based  on  the  68020  chip  and  can 
support  up  to  16  Mb  of  main  memory.  The  SUN  op¬ 
erating  system  is  an  extension  of  UNIX  4.2BSD  (the 
extension  allowing  for  such  innovations  as  windowing). 
The  design  philosophy  of  the  equipment  is  to  closely  fol¬ 
low  industry  standards  (or  working  standards)  as  much 
as  possible.  Hence,  SUN  uses  the  ETHERNET  as  its 
network,  in  contrast  to  APOLLO  who  developed  their 
own  DOMAIN  network.  In  comparison  with  APOLLO’s 
products,  the  SUN  workstations  are  faster,  have  more 
main  memory,  conform  more  closely  to  industry  stan¬ 
dards,  and  are  totally  based  on  UNIX. 

Future  developments  in  SUN  workstations  include  a 
HYPERchannel  connection  in  May  1986.  SUN  already 
has  a  connection  to  the  Cray,  the  so  called  “Craysta- 
tion.” 

The  current  graphics  on  the  SUN  is  called  SunCore 
and  SunCGI.  SunCGI  complies  with  the  ANSI  and  ISO’s 
draft  of  the  Computer  Graphics  Interface  for  fast  two- 
dimensional  graphics.  SunCore  follows  the  SIGGRAPH 
de  facto  standard  for  two-  or  three-dimensional  graph¬ 
ics.  SunGKS  is  available  for  more  advanced  graphical 
capabilities. 

Some  pertinent  technical  facts  about  the  SUN  sta¬ 
tions  are  given  by  Sun  (1985).  The  high  resolution 
screens  on  the  SUN  are  1152  (h)  by  900  (v)  pixels.  The 
refresh  is  at  66Hz  non-interlased.  The  color  pallet  has 
eight  planes  with  256  simultaneously  displayed  colors. 
The  SUN  requires  no  special  temperature  (0°C  -40°C), 
humidity  (5-95%),  or  altitude  (0-3, 000m)  environments. 
The  Motorola  68020  chip  run  at  16.67MHz  is  included 
in  each  station  along  with  the  68881  coprocessor  which 
is  run  at  12.5MHz.  Spanier  (1985)  provides  some  tim¬ 
ing  comparisons  between  DEC  machines  with  either  a 
“UNIX  or  a  UNIX-like  operating  system”  and  the  SUN- 
3.  The  SUN-3  without  a  floating  point  processor  is  1.6 
times  as  fast  as  a  VAX  780  in  doing  floating  point  com¬ 
putations.  It  is  about  1.8  times  as  fast  in  integer  com¬ 
putations.  These  figures  are  just  slightly  better  when 
compared  to  a  microvax  II.  The  SUN-3  with  a  floating 
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point  processor  is  4.6  times  as  fast  as  a  VAX  780  in 
doing  floating  point  computations.  This  figure  is  just 
slightly  better  when  compared  to  a  microvax  II.  Due  to 
the  unavailability  of  the  new  APOLLO  MC68020-based 
system,  a  comparison  is  not  available  at  this  time. 

Figure  1  contains  a  listing  of  the  system  hardware 
that  will  meet  our  research  needs.  The  system  is  config¬ 
ured  for  six  users.  Figure  2  is  a  schematic  representation 
of  the  subnetwork  with  the  system  configuration  of  the 
hardware  described  in  Figure  1.  The  above  selection  of 


QUANTITY  DEVICE 

1  3/160C 

2  3/160C 

1  3/160M 

2  3/50 

1  Ethernet 

1  Communication  Box 

1  Tape  Drive 

2  2  Eagle  Disks 

1  Laser  Writer 

1  Benson  Printer 


hardware  and  the  final  system  configuration  is  intended 
to  be  used  for  statistical  computing.  Though  this  sys¬ 
tem  may  be  appropriate  for  other  groups,  it  should  not 
be  viewed  as  prototypical.  A  big  advantage  of  a  subnet¬ 
work  of  workstations  is  the  ability  to  create  a  specific 
computing  environment  customized  to  the  users’  needs. 
Hence,  one  should  carefully  assess  one’s  needs  when  de¬ 
signing  and  acquiring  a  computing  environment.  In  par¬ 
ticular,  the  following  (somewhat  obvious)  steps  should 
be  taken: 


HARDWARE 


68020, 68881, FPA,GB,GP,16MEG, Color, High  Res. 
68020, 68881, 4MEG, Color, High  Res. 

68020, 68881, 4MEG,High  Res.,  File  Server 
68020, 68881, 4MEG, Monochrome, High  Res. 
Cable  and  Terminators 
Connection  to  the  GMR  Ethernet 
1600  BPI  Tape  Drive 
Mass  Storage  760  MBytes 
Monochrome  High  Res.  Hard  Copy 
High  Res.  Color  Ink  Jet 
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1.  State  the  computing  requirements  (e.g.  number  of 
users,  type  of  network,  graphics  capabilities,  out¬ 
put  devices,  floating  point  speed,  graphics  process¬ 
ing  speed,  level  of  hardware  and  software  mainte¬ 
nance  and  support,  and  operating  system  charac¬ 
teristics).  Scott  (1985)  gives  a  checklist  of  vari¬ 
ables  to  consider  in  general,  but  especially  when 
the  computing  environment  is  to  be  used  for  code 
development  and  production  work  in  scientific  com¬ 
putation. 

2.  Identify  candidate  vendors  for  the  equipment  and 
software. 

3.  Compare  vendor  capabilities  and  screen  out  ven¬ 
dors  not  meeting  the  requirements. 

4.  Identify  possible  system  configurations  among  the 
contending  vendors. 

5.  Select  the  system  which  will  best  satisfy  the  re¬ 
quirements.  Scott  (1985)  suggests  that  the  prob¬ 
lem  may  be  formulated  as  a  linear  program. 

Taking  these  steps  is  important  in  order  to  make  full  use 
of  the  new  technology  provided  by  a  distributed  com¬ 
puting  network,  where  each  node  (here,  a  subnetwork  of 
workstations)  on  the  backbone  network  should  offer  an 
environment  customized  to  enhance  the  efficiency,  pro¬ 
ductivity,  and  capabilities  of  the  particular  user  group. 
Moreover,  the  selection  of  a  workstation  should  help  one 
to  go  in  new  research  directions  and  not  merely  to  mimic 
or  expedite  current  capabilities. 
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Knut  M.  Wittkowski,  Univers 
ABSTRACT 

Until  now,  most  approaches  for  building 
expert  systems  with  applications  in  sta¬ 
tistics  have  concentrated  on  the  area  of 
generating  hypotheses  (RX,  GUHA-80,  REX, 
STUDENT,  GLIM-Front-End ) .  These  expert 
systems  make  decisions  on  the  basis  of 
the  empirical  distribution  of  the  data, 
subjective  opinion,  and  some  a-priori 
knowledge.  The  present  paper  proves 
some  concepts  underlying  these  systems 
to  be  inappropriate  for  testing  statis¬ 
tical  hypotheses . 

Based  on  a  new  rating  and  a  new  clas¬ 
sification  of  knowledge  on  statistical 
concepts,  problems  and  methods  and  on 
rules  for  checking  the  appropriateness 
of  sub-problems,  selection  of  statisti¬ 
cal  methods  is  formalized  as  a  special 
pattern  recognition  process.  It  is  de¬ 
monstrated,  how  an  expert  system  can 
support  the  user  in  choosing  methods 
and  interpreting  results. 

KEY  WORDS:  artificial  intelligence, 

experimental  design,  generalized 

linear  model,  nonparametr ic  sta¬ 
tistics,  multiple  comparisons, 

confirmatory  data  analysis 

1 .  INTRODUCTION 

Statistical  methods  are  frequently  used 

to  identify  criteria  that  allow  for 
discrimination  between  groups  or  to 
predict  the  outcome  of  some  event, 
to  generate  hypotheses  that  provide 
an  explanation  of  some  biological, 
sociological  or  economical  process, 
to  test  some  of  these  hypotheses  on 
an  observed  set  of  data. 

Each  methods  is  (implicitly)  based  on  a 
mathematical  model,  so  that  on  the  one 
hand  a  decision  on  a  model  can  be  made 
observing  the  result  of  different  meth¬ 
ods  and  on  the  other  hand  a  method  can 
be  selected  according  to  a  predefined 
model . 

If  a  model  is  build  for  the  purpose 
of  prediction  or  discrimination,  there 
is  no  need  that  its  parameters  repre¬ 
sent  concepts  that  have  an  interpreta¬ 
tion  in  reality.  The  major  goal  is  to 
provide  a  "black  box"  that  gives  a  good 
prediction  or  few  mis -c 1  ass i f icat ions , 
respectively.  In  medicine,  for  in¬ 
stance,  a  model  containing  some  unre¬ 
alistic  parameters  might  lead  to  cor¬ 
rect  diagnoses  and  a  treatment  might  be 
useful,  even  if  the  underlying  mecha¬ 
nism  is  not  known  (  until  recently,  no 
one  knew,  how  aspirin  stills  pain  ). 

In  generating  hypotheses  it  is  not 
sufficient  to  find  a  model  that  allows 
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prediction  of  an  outcome,  but  it  is 
also  necessary  that  this  model  can  be 
interpreted,  e.g.  that  it  shows  which 
parameters  influence  this  outcome. 

In  testing  hypotheses,  the  require¬ 
ments  are  even  more  restrictive,  because 
the  purpose  is  not  only  to  look  for  a 
model  that  provides  an  explanation,  but 
to  compute  .he  probability  of  erroneous¬ 
ly  choosing  "significant"  parameters  for 
this  model. 

This  paper  discusses  implications  of 
these  special  demands  in  the  field  of 
testing  hypotheses  on  expert  systems. 
Section  2  discusses  some  concepts  used 
in  expert  system  approaches  for  generat¬ 
ing  hypotheses.  It  is  demonstrated,  why 
most  of  these  concepts  are  not  applic¬ 
able  for  testing  hypotheses.  A  new  con¬ 
cept  is  introduced,  that  is  based  nei¬ 
ther  on  data  nor  on  assumptions  or 
knowledge  but  on  the  interest  in  the 
analysis.  In  Section  3  a  classification 
of  relevant  criteria  is  introduced  and 
the  area  of  applicability  is  defined. 
Section  4  gives  a  solution  for  the  prob¬ 
lem  of  multiple  analyses  on  subsets  of 
the  same  set  of  data  and  a  representa¬ 
tion  of  the  process  of  selecting  statis¬ 
tical  methods  as  a  special  pattern  rec¬ 
ognition  process.  An  example  is  given 
in  Section  5  and  in  Section  6  some  con¬ 
sequences  of  these  new  concepts  are  out¬ 
lined  . 

2.  CRITERIA  FOR  MODEL  SELECTION 

2.1.  Generalized  linear  models 

The  process  of  building  a  model  starts 
with  formalizing  the  information  avail¬ 
able  prior  to  observing  the  data.  This 
a-priori  knowledge  and  the  data  remain 
unchanged  in  the  following  process  of 
fitting  different  models  to  the  data. 
This  process  wi 1 1  be  discussed  in  the 
context  of  generalized  linear  models: 

L(Yi;jk)  =  f0  +  fi  <ai)+f2<bj)  +  .  .. 

+  f  n  (  abj  j  )  + .  .  .  +  e^  -jk 

where  observations  Y^jk  are  decomposed 
into  main  effects  a^,  bj ,  ...  ,  in¬ 
teractions  at>ij,  acjk,  bcjk,  ...  ,  and 
an  error  term  e^jk  representing  resid¬ 
uals  that  cannot  be  explained  by  main 
effects  or  1 n ter ac t ions . 

After  all  possible  terms  for  the 
model  equation  have  been  identified,  a 
set  of  terms  (main  effects  and  interac¬ 
tions),  a  set  of  functions  ft  ,  and  a 
"  1  i nk ”-f unc t ion  L  are  selected  to  find 
a  model  that  fits  "best"  (e.g.  in  terms 
of  least  squares).  For  prediction  and 
discrimination  criteria  for  selection 
are  typically  based  on  the  distribution 
of  the  dat...  For  hypotheses  generation 


some  a-priori  knowledge  or  assumptions 
on  the  area  of  application  may  also  be 
used  to  mark  all  terms  that  are  known 
to  be  necessary  or  undesired  for  inter¬ 
pretation.  Examples  of  expert  systems 
for  these  purposes  are  RX  (BLUM  1978), 
REX  /  STUDENT  (GALE  and  PREGIBON  1984), 
GUHA  80  (HAJEK  and  IVANEK  1982),  and  the 
front-end  for  GLIM  (NELDER  and  WESTEN- 
HOLM  1986).  In  the  following,  these 
systems'  sources  for  information  (data, 
assumptions,  knowledge)  will  be  dis¬ 
cussed  with  respect  to  appl icabi 1 i ty  in 
the  field  of  testing  hypotheses  and  a 
new  concept  will  be  introduced.  To  sim¬ 
plify  terminology,  prediction  and  dis¬ 
crimination  will  be  treated  as  special 
cases  of  hypothesis  generation. 


If  an  expert  system  is  to  be  used  tor 
testing  hypotheses,  it  must  not  derive 
its  information  from  the  data.  (  There 
are  only  few  exceptions,  like  adaptive 
rank  tests,  where  looking  at  the  data 
does  not  affect  the  conservatism  of  the 
test  procedure.  These  exceptions  will 
not  be  considered  here.  )  This  argument 
rules  out  application  of  the  above  men¬ 
tioned  expert  systems  for  the  purpose 
of  testing  statistical  hypotheses. 

2.3.  Rules  based  on  assumptions 

Many  "rules"  in  common  textbooks  are 
based  on  assumptions  on  the  distribution 
of  the  random  variables: 

R  U  L  E  -  2 


2.2  Rules  based  on  data 

Consider  the  case  where  the  influence  of 
a  treatment  is  to  be  tested  in  a  (gener¬ 
alized)  linear  model  with  several  fac¬ 
tors  and  the  experimenter  is  not  sure 
which  terms  are  to  be  included  into  the 
model  equation.  Suppose  the  expert  sys¬ 
tem  decides  on  the  basis  of  rules  like 

IF  inclusion  of  term  ft^xijk* 

leads  to  a  higher  F-ratio  for 
the  c-th  factor, 

THEN  include  this  term  into  the 
model  equation. 

Although  rules  might  often  look  differ¬ 
ent,  they  may  have  a  similar  effect  on 
the  decision.  The  following  rule,  for 
instance,  is  taken  from  REX  (GALE  and 
PREGIBON  1984): 

R  U  L  E  -  1 

IF  the  distribution  of  y  is 

unduly  skew 

AND  the  sign  of  y  is  positive 

THEN  assert  that  logarithms  ot 
the  response  variable  y 
should  be  used . 

In  generating  hypotheses  these  rules 
might  be  very  useful,  because  in  that 
context  minimizing  variability  of  the 
residuals  is  most  important  and  RULE-1 
might  be  a  valuable  suggestion.  Whether 
or  not  the  variability  was  actually  re¬ 
duced  can  be  checked  on  the  data.  In 
testing  hypotheses,  however,  the  follow¬ 
ing  argument  proves  rules  based  on  the 
empirical  distribution  of  the  data  to  be 
inapplicable:  The  more  rules  are  avail¬ 

able  and  the  more  functions  h(y^jl^) 
and  ft^xijk*  are  considered,  the 

greater  is  the  probability  of  finding 
a  model  that  leads  to  a  statistic  with 
a  p-value  less  than  a  given  alpha. 

If  the  p-value  is  of  interest,  i.e. 
if  the  decision  was  intended  to  be  made 
with  a  limited  probability  of  an  error 
of  first  kind,  the  model  that  was  de¬ 
termined  to  be  the  "best"  fit  must  under 
no  circumstances  be  used  tor  testing  the 
hypothesis  of  no  treatment  effects  ! 


IF  the  distribution  of  Y 
is  log-normal, 

THEN  logarithms  of  the  response 
variable  y  should  be 
used  . 

There  is  no  doubt  that  RULE-2  is  true: 
If  two  log-normally  distributed  groups 
are  to  be  compared  with  respect  to  the 
location  of  the  random  variable,  taking 
logarithms  provides  for  a  higher  effi¬ 
ciency,  i.e.  a  smaller  number  of  obser¬ 
vations  is  necessary  to  achieve  a  signi¬ 
ficant  result  for  groups  differing  in 
location . 

However,  how  should  one  know  that  the 
assumption  is  true  ?  Assumptions  on 
distributions  are  not  very  realistic  for 
many  applications:  Log-normal  distri¬ 
bution  of  residuals,  for  instance,  can 
only  be  guaranteed,  if  all  unknown 
sources  of  variation  are  both  multipli¬ 
cative  and  independent  1  The  (asympto¬ 
tic)  relative  efficiency  (ARE)  of  two 
methods,  however,  depends  heavily  on  the 
distribution  of  the  residuals  in  the 
model  chosen:  The  t-test,  for  instance, 
is  more  efficient  than  the  U-test  only 
for  some  distributions  (e.g.  Gaussian); 
tor  other  distributions  (e.g. 
logistic)  the  converse  might  be  true. 

In  some  applications  the  data  is  used 
to  "prove"  that  an  assumption  is  met. 
This  approach  is  based  on  a  common  mis¬ 
understanding  of  the  concept  of  "signi¬ 
ficance":  A  non-significant  test  for 

deviation  from  normality  does  not  mean 
that  the  distribution  is  proven  to  be 
normal.  Even  if  alpha  is  chosen  to  be 
as  high  as  .20,  .50,  or  .80,  it  is  the 

unknown  error  of  second  kind  (beta)  that 
counts.  The  effect  of  an  error  of  second 
kind  in  "proving"  the  appropr iateness  of 
some  assumption  on  the  error  of  first 
kind  in  testing  a  hypothesis  is  not  pre¬ 
dictable. 

Again  some  information  currently  used 
in  hypothesis  generation  has  to  be  re¬ 
evaluated  for  hypothesis  testing:  As¬ 
sumptions  on  the  distribution  of  resid¬ 
uals  are  helpful  only  in  rare  occasions, 
where  all  sources  of  variation  are 


known  and  their  independence  is  actually 
proven . 

2.4.  Rules  based  on  knowledge 


2.5.  Rules  based  on  Interest 


Where  information  from  the  data  itself 
cannot  be  used  and  assumptions  are  not 
realistic,  knowledge  that  is  independent 
on  the  observed  data  has  to  be  consid¬ 
ered  as  a  basis  for  selection  of  sta¬ 
tistical  methods.  Because  so  far  most 
expert  systems  have  been  developed  for 
hypothesis  generation,  knowledge  has 
been  given  little  attention.  If  a-priori 
information  is  considered  at  all, 
knowledge  bases  typically  contain  rules 
as  in  the  following  example  from  the 
STATPATH  knowledge  base  (PORTER  and  LAI, 
1983): 

R  U  L  E  -  3 

IF  the  scale  is  ordinal 

THEN  use  a  rank  test 

It  is  well  known,  that  a  rank  test  is 
more  efficient  for  detecting  differences 
in  location  than  a  chi-square  test. 
However,  if  the  purpose  of  the  analysis 
is  to  detect  any  difference  in  distri¬ 
bution  (  location,  skewness,  number  of 
modes,  scale,  etc.  )  a  rank  test  might 
be  extremely  insensitive. 

Consider  the  following  example,  where 
the  status  after  a  treatment  (placebo 
or  verum)  was  measured  in  terms  of 
"better"  (  +  ),  "unchanged"  (  0  ),  and 
"worse"  ( - ) : 
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A  rank  test  (U-test  corrected  for  ties) 
on  the  difference  between  placebo  and 
verum  leads  to  a  test  statistic  of 
U  =  0  ,  i.e.  the  test  would  not  be  sen¬ 
sitive  for  the  observed  type  of  differ¬ 
ence  in  effects.  A  rank  test  is  only 
appropriate,  if  the  "tendency"  in  effect 
is  of  interest  (i.e.  whether  the  proba¬ 
bility  of  a  preferable  result  is  differ¬ 
ent  for  the  treatments).  On  the  other 
hand,  the  chi-square  test  is  appropri¬ 
ate,  if  any  difference  in  the  effects 
is  of  interest. 

The  scale  level,  however,  must  not  be 
ignored,  either.  Even  if  the  data  in 
the  example  above  had  been  numerical¬ 
ly  coded  (  0:="-",  1:="o",  2:="+"  ), 
a  t-test  would  typically  give  a  meaning¬ 
less  result  for  ordinal  or  nominal  vari¬ 
ables.  It  follows  that  the  scale  level 
restricts  the  set  of  possible  methods 
but  does  not  determine  the  method  (ex¬ 
cept  for  nominal  variables,  where  means 
and  standard  deviations  are  meaningless, 
even  if  common  analysis  systems  like 
BMDP,  P-STAT,  SAS,  and  SPSS  typically 
provide  these  measures  as  defaults  for 
"descriptive"  analysis.) 


As  demonstrated  in  the  examples  above, 
for  testing  hypotheses  the  rating  of 
efficiency  and  consistency  as  criteria 
for  selection  of  models  (or  methods)  has 
to  be  re-evaluated.  Efficiency  was  de¬ 
fined  as  a  measure  of  the  number  of  ob¬ 
servations  necessary  to  achieve  a  sig¬ 
nificant  result  under  a  given  alterna¬ 
tive.  Consistency  of  a  method  against 
a  certain  type  of  alternative  means 
that  any  alternative  of  this  type  will 
lead  to  a  significant  result  for  (al¬ 
most)  any  underlying  distribution,  pro¬ 
vided  the  sample  size  is  big  enough. 

Until  now,  efficiency  was  (implic¬ 
itly)  considered  the  more  important  cri¬ 
terion  for  selecting  statistical  meth¬ 
ods.  This  view  resulted  from  the  wealth 
of  results  concerning  efficiency  of 
test  statistics  in  the  field  of  mathe¬ 
matical  statistics,  where  the  set  of 
distributions  is  typically  restricted 
(e.g.  to  Gaussian  distributions)  to 
ensure  that  all  methods  are  consistent 
against  the  same  alternatives. 

Assumptions  on  distributions,  how¬ 
ever,  have  been  demonstrated  to  be 
neither  realistic  nor  provable  on  the 
basis  of  the  observations.  On  the  other 
hand,  the  assumption  of  normality  often 
taken  as  a  reason  for  transforming  data 
is  relatively  unimportant  even  to  ana¬ 
lysis  of  variance  procedures  (including 
the  well-known  t-test)  as  far  as  con¬ 
sistency  is  concerned.  As  a  conse¬ 
quence,  the  asymptotic  relative  effi¬ 
ciency,  though  important  in  the  field 
of  theoretical  statistics,  is  of  little 
value  in  the  field  of  applied  statis¬ 
tics  . 

Comparing  these  concepts  it  seems 
more  reasonable  to  base  a  decision  on 
consistency  rather  than  on  efficiency. 
Rules  based  on  efficiency  might  result 
in  the  "best"  solution  for  the  wrong 
problem,  i.e.  a  completely  misleading 
decision,  while  decisions  based  on 
consistency  will  givt  results  for  the 
problem  of  interest,  even  if  this  solu¬ 
tion  is  "not  optimal”,  i.e.  given  with 
a  p-value  that  might  be  not  exact. 

In  terms  of  consistency,  there  is  a 
second  reason,  why  RULE-1  is  not  appli¬ 
cable  for  testing  hypotheses:  It  is  a 
well-known  fact  that  (1)  log-normally 
distributed  data  are  skewed  to  the  left 
and  positive  and  (2)  the  geometric  mean 
is  the  most  efficient  estimator  of  the 
median  for  this  type  of  data.  This  im¬ 
plication,  however,  cannot  be  inverted: 
A  distribution  that  is  skewed  to  the 
left  and  positive  need  not  be  log¬ 
normal.  Because  geometric  mean  and 
median  estimate  different  parameters  for 
most  other  distributions,  they  are  not 
comparable  in  terms  of  efficiency. 
Taking  logarithms  and  computing  means 
may  lead  to  completely  different  results 
than  computing  medians,  because  it  leads 


not  only  to  a  t rans t ormat ion  of  the  re¬ 
siduals  but  also  to  a  to  a  transforma¬ 
tion  of  the  problem. 

The  empirical  distribution  in  all 
four  groups  of  the  following  example 
is  skewed  to  the  left  and  all  observa¬ 
tions  are  positive.  Means  and  standard 
deviations  are  even  positively  corre¬ 
lated,  as  they  should  be  in  log-nor¬ 
mally  distributed  data.  The  example 
proves  that  differences  in  logarithms 
are  sensitive  to  differences  both  in 
location  and  in  scale  of  the  origi¬ 
nal  data  (  B  vs.  C  ).  Differences  in 
scale  might  neutralize  or  even  reverse 


differences  in  location  (  A  vs.  B  and 
C  vs.  D,  respectively).  Moreover,  var¬ 
iances  are  not  "stabilized"  by  computing 
logarithms  of  the  data  (  y  =  log(x)  ). 
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An  expert  system  for  testing  hypotheses 
needs  rules  that  depend  on  the  interest 
(  type  of  influence  ),  i.e.  whether  a 
reduction  by  50%  is  as  relevant  as  an 
increase  by  1 00%  (  RATIO:  compute  loga¬ 
rithms  )  or  by  50%  (  EXPECT:  estimate 

expectation  of  random  variables  using 
the  original  data): 

IF  relative  differences  are 
of  interest, 

THEN  logarithms  of  the  response 
variable  y  should  be 
used  . 

3.  FORMAL  RELATIONS 

The  examples  above  demonstrate  that 
common  "heuristics"  are  for  different 
reasons  often  inapplicable  as  rules  for 
expert  systems  in  the  area  of  testing 
hypotheses:  Literature  on  theoretical 

aspects  of  statistical  models  is  typi¬ 
cally  based  on  assumptions  that  are  not 
realistic.  Literature  on  applied  sta¬ 
tistics  often  contains  misleading  or 
even  wrong  recommendations. 

Evaluation  of  expert  systems  in  dif¬ 
ferent  areas  of  application  shows  that 
"heuristics"  are  not  an  economical  way 
to  represent  knowledge  (  see  WITTKOWSKI 
1986  for  a  more  detailed  discussion  ) 
but  that  it  is  desirable  to  structure 
knowledge  in  order  that  the  description 
of  an  object  or  relation  can  be  inher¬ 
ited  from  the  object-  or  re  1  at  ion- 1 ypes 
it  belongs  to. 


Until  recently,  there  was  no  unique 
c 1  ass  1 1 1 c a 1 1  on  ot  objects  and  relations, 
although  this  lack  of  met  a - know 1  edge 
has  already  been  recognized  e.g.  by 
MOLENAAR  (1984):  "  If  the  statistical 
community  succeeds  in  producing  a  work¬ 
able  c 1  ass  1 f ica t ion  of  all  or  most  data 
sets,  a  statistical  expert  system  could 
be  very  helpful  in  assessing  the  ade¬ 
quacy  and  robustness  ot  some  statistical 
techniques  for  the  particular  data  set 
cons idered .  “ 

Because  statistical  analysis  Is  pre¬ 
determined  by  the  way  an  experiment  is 
planned  or  data  are  collected  in  a 
(retrospective)  study,  looking  for  such 
a  c 1  ass l f ica 1 1 on  is  much  more  promising 
in  the  field  of  statistical  analysis 
than,  for  instance,  in  the  field  of 
medical  diagnosis.  The  following  con¬ 
cept  for  structuring  knowledge  was 
introduced  by  WITTKOWSKI  (1984a). 

Data  are  typically  arranged  as  rec¬ 
tangular  tables,  where  rows  correspond 
to  observat lonal  units  (days,  patients, 
rats  etc. )  and  columns  to  a  set  of 
variables  associated  with  each  type  of 
observational  unit.  If  all  those  tables 
are  joined  according  to  the  observable 
relations  defined  by  the  obser vat l onal 
units,  the  resulting  universal  relation 
describes  all  observed  relations. 

The  structure  defined  by  the  a-priori 
knowledge  on  the  variables  will  be  re¬ 
ferred  to  as  theoretical  relations  in¬ 
cluding  cl  ass i f icat ion  of  dependent, 
nuisance,  and  independent  variables, 
strategy  of  sampling.  Si-units,  format 
of  data,  level  and  type  of  scale,  etc.  . 
All  non-observable,  but  testable  decla¬ 
rations  of  relevant  types  of  influence 
on  dependent  variables  (  differences  in 
distribution,  expectation,  tendency,  or 
dispersion  )  will  be  referred  to  as 
hypothetical  relations.  Requirements  on 
the  representation  of  the  results  (e.g. 
tables,  plots,  test  statistics)  will  be 
referred  to  as  output  types.  Observed 
relations  will  also  be  called  actual 
relations,  while  observable,  theoretical 
and  hypothetical  relations  will  be  re¬ 
ferred  to  as  formal  relations 

As  proven  in  Wittkowski  (1985),  for¬ 
mal  relations  are  sufficient  tor  choos¬ 
ing  appropiiate  statistical  methods  and 
interpreting  their  results,  as  fai  as 
consistency  is  concerned,  provided  that 
a  suitable  class  of  statistical  methods 
is  selected.  Currently,  this  concept 
has  been  proven  to  be  sufficient  for 
linear  models  (analysis  ot  variance  and 
covariance),  tendency  models  (several 
non-paramet i lc  models  based  on  ranks), 
"semi "-par amet r lc  models  (ranking  after 
alignment),  log-linear  models  (analysis 
of  contingency  tables),  and  several 
graphical  and  tabular  techniques.  It 
can  easily  be  generalized  e.g.  to  prin¬ 
cipal  component  analysis,  analysis  of 
dispersion  etc .  . 
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4.  MULTIPLE  ANALYSES 


5.  THE  USER  INTERFACE 


In  generating  hypotheses,  the  same  set 
of  data  is  typically  split  into  over¬ 
lapping  subsets  and  analysed  with  vari¬ 
ous  methods  to  find  a  model  that  fits 
best.  For  the  reason  outlined  in  Sec¬ 
tion  2.2  multiple  analyses  of  the  same 
set  of  data  may  cause  serious  diffi¬ 
culties  in  testing  hypotheses.  For  in¬ 
stance,  it  is  neither  possible,  to  try 
a  t-test  on  logarithms,  a  t-test  on  the 
original  data,  a  U-test  and  a  chi-square 
test  for  the  same  set  of  data  nor  trying 
out  both  the  paired  and  unpaired  design 
prior  to  deciding  which  result  is  to  be 
published.  It  is  obvious,  that  the 
probability  of  an  error  of  the  first 
kind  will  be  much  higher  than  the 
p-value  of  the  result  chosen  on  the 
basis  of  this  "  principle  of  most  sig¬ 
nificance  ".  In  the  terminology  of 
Section  3,  confirmatory  analyses  require 
that  observable  and  hypothetical  rela¬ 
tions  must  not  be  modified  during  analy¬ 
sis. 

Therefore,  an  expert  system  for  test¬ 
ing  hypotheses  should  know  not  only  the 
original  (conceptual)  theoretical  rela¬ 
tions  but  also  the  original  (conceptual) 
observable  and  hypothetical  relations. 
This  knowledge  on  formal  relations  will 
be  referred  to  as  conceptual  problem 
type.  Based  on  this  knowledge,  the 
system  can  decide  which  modifications  of 
the  model  are  allowed  in  the  process  of 
defining  a  derived  external  problem  type 
(e.g.  defining  a  projection  on  the  data¬ 
set  for  uni-  or  bi-variate  statistics, 
defining  a  restriction  for  simple  main 
effects  and  a-posteriori  multiple  com¬ 
parisons)  and  which  are  not  (e.g.  test¬ 
ing  both  the  original  and  transformed 
data,  including  terms  in  the  conceptual 
model  equation  or  excluding  terms  from 
it).  For  the  special  cases  given  above 
(c.f.  WITTKOWSKI  1985)  the  system  can 
even  compute  the  adjustment  necessary  to 
derive  the  "global"  p-value  from  the 
"local"  p-value  given  by  the  method. 

Based  on  this  new  concept  of  rating 
and  structuring  knowledge  on  statistical 
concepts,  problems,  and  methods,  selec¬ 
tion  of  appropriate  methods  can  be 
treated  as  a  special  pattern  recognition 
process  (  see  WITTKOWSKI  1986  for  de¬ 
tails  ),  which  consists  of 

1 )  representing  problems  and  methods 
based  on  da t a- i ndependent  relations 
(  conceptual  and  implicit  problem 
types,  respectively  ), 

2)  choosing  sub-designs  by  projecting 
and  restricting  the  conceptual  prob¬ 
lem  type, 

3)  normalizing  the  external  problem  type 
and  selecting  a  method  with  a  corres¬ 
ponding  implicit  problem  type,  and 

4)  verifying  assumptions  of  the  method 
on  the  data. 


During  this  pattern  recognition  process 
knowledge  acquisition  and  knowledge  ap¬ 
plication  can  be  supported  at  several 
stages  : 

Acquisition  of  knowledge  (from  ex¬ 
perts  in  applied  and  theoretical  sta¬ 
tistics)  on  conceptual  and  implicit 
problem  types  can  be  facilitated  by 
fast  dialogue  procedures  and  be 
verified  by  testing  its  consistency 
(  not  shown  in  this  paper  ) . 

By  deduction  from  the  conceptual 
problem  type  the  amount  of  input 
necessary  to  define  a  sub-problem  is 
reduced,  inconsistencies  in  the  non- 
deducible  information  are  explained 
or,  al ternat i vely ,  pop-up  menus  con¬ 
taining  only  consistent  alternatives 
are  presented,  and  the  user  is  given 
hints  for  interpretation  (  see  the 
example  below).  The  set  of  necessary 
parameters  may  be  explained  by  means 
of  intelligent  tutoring. 

The  expert  system  automat ical ly 
chooses  and  calls  an  appropriate 
statistical  method  with  a  corre¬ 
sponding  implicit  problem  type. 
Consider,  for  example,  an  experiment 
where  each  of  10  patients  is  given 
three  doses  of  a  medication  (BETADOSE) 
subsequent  to  two  different  techniques 
of  operation  (OPERTYPE)  applied  after 
cardiac  infarction.  Note  that  patients 
are  nested  within  factor  OPERTYPE. 
Bodyweight  ( BODYWGHT )  was  measured  at 
entry  in  the  study,  vigour  (ERGOMETR) 
for  each  dose.  Suppose  that  the  goal  of 
the  study  was  to  measure  the  effect  on 
ERGOMETR  in  terms  of  expectation  and 
that  a  linear  relation  between  BODYWGHT 
and  ERGOMETR  is  assumed.  Input  of 
"ERGOMETR"  would  result  in  the  following 
mask  on  the  display: 


NAM 

OPERTYPE!  (3) 

BETADOSE 

BODYWGHT 

ERGOMETR 

MIN 

(  1  ) 

(  1  ) 

(2) 

(2) 

MAX 

I/O 

EXPECT  j 

EXPECT 

LINEAR 

TEST 

Without  a  modification  the  system  calls 
automat ical ly  an  analysis  of  covariance. 
If  OPERTYPE  or  BETADOSE  is  restricted  to 
one  or  two  categories  (1 )  ,  the  nec¬ 
essary  modifications  of  the  analysis 
system’s  ( BMDP ,  P-STAT,  SAS,  SPSS,  etc.) 
control  language  are  generated  and  the' 
user  is  given  the  information,  how  to 
compute  a  global  p-value  from  the  local 
p-value  given  by  the  analysis  system. 

I f  BODYWGHT  or  ERGOMETR  are  restricted 
(2)  ,  the  influence  types  are  modified 
(  e.g.  LINEAR  also  for  BETADOSE  ),  or  a 
variable  PATINGRP  with  categories  1-5  is 
introduced  (3),  the  appropriate  methods 
are  called  as  well,  but  the  user  is 
given  the  information  that  the  result 
has  to  be  interpreted  as  being  explora¬ 
tory  . 


6.  CONCLUSIONS 
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The  first  statistical  expert  systems 
were  designed  for  applications  in  the 
field  of  generating  hypotheses.  In  the 
field  of  testing  hypotheses,  new  con¬ 
cepts  for  knowledge  representation  are 
proven  to  be  necessary.  A  formal  de¬ 
scription  of  interest  in  the  result  of 
an  analysis  (  hypothetical  relations  ) 
is  proposed  and  demonstrated  to  be  suf¬ 
ficient  for  a  wide  range  of  methods. 

This  expert  system  approach  signifi¬ 
cantly  reduces  the  amount  of  information 
to  be  entered  during  analysis  of  sub¬ 
problems  so  that  fast  dialogue  proce¬ 
dures  can  facilitate  access  to  analysis 
systems.  Because  the  user  formulates 
questions  (  restrictions  of  projections 
on  the  data  set  and  hypothetical  rela¬ 
tions  )  instead  of  choosing  some  test 
procedures,  this  concept  leads  to  fewer 
erroneous  applications  of  statistical 
methods  .  Knowledge  on  conceptual  and 
external  problems  can  also  be  used  to 
explain  results  ("significances"!  to  the 
experimenter  and  thus  help  him  to  inter¬ 
pret  these  results  in  terms  of  hypothe¬ 
tical  relations.  Automatic  adjustment 
of  p-levels  in  multiple  analyses  of  the 
same  data  set  reduces  the  frequency  of 
misinterpreting  "significant"  results. 

The  distinction  of  formal  and  actual 
relations  not  only  simplifies  access  to 
the  data  and  interpretation  of  results 
but  is  also  essential  to  the  concept  of 
knowledge  engineering  (WITTKOWSKI  1986): 
Actual  integrity  constraints  can  be 
checked  either  by  data  base  management 
systems  or  inside  the  methods,  while 
formal  integrity  constraints  can  be 
checked  by  the  expert  system.  Because 
access  to  the  data  is  not  necessary  to 
compare  the  definition  of  an  (external) 
problem  with  the  knowledge  bases,  data 
and  knowledge  may  be  stored  on  distant 
computer  systems  (c.f.  WITTKOWSKI  1985). 

Although  this  concept  has  been  origi¬ 
nally  developed  for  applications  in  the 
field  of  testing  hypotheses,  it  has  im¬ 
plications  also  on  generating  hypothe¬ 
ses:  Criteria  for  selection  of  methods 
in  the  search  for  a  model  that  fits  to 
the  data  can  be  based  on  the  knowledge 
on  the  theoretical  relations.  The  de¬ 
scription  of  implicit  problem  types  as¬ 
sociated  with  each  method  can  be  used  to 
help  interpreting  models  in  the  area  of 
appl icat ion . 

The  following  consequences  are  re¬ 
lated  to  more  insight  into  statistical 
methods:  Common  concepts  underlying 
these  methods  will  become  explicitly 
defined,  so  that  they  can  be  discussed 
among  experts,  and  some  "heuristics" 
can  be  replaced  by  deterministic  rules. 
This  new  insight  to  meta-knowledge  will 
allow  to  concentrate  on  relatively  few 
concepts  instead  on  various  methods  so 
that  unnecessary  technical  details  can 
be  omitted  in  teaching  statistics. 
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AN  EVALUATION  OF  TDS  SERINS  ANALYSIS  PROGRAMS  AVAILABLE  IN  THREE  MAJOR 
STATISTICAL  COMPUTER  PACKAGES 

Terry  J.  Woodfield,  Arizona  State  Univeraity 


1.  INTRODUCTION 

Three  popular  atatiatical  software  packages, 
BMDP,  SAS,  and  SPSS’*,  have  implemented  programs 
to  perform  what  is  coasonly  referred  to  as 
Box- Jenkins  Time  Series  Modeling.  Many  other 
software  products  exist  that  perform  calculations 
related  to  time  series  modeling,  such  as  RATS  and 
SC  A,  but  it  appears  that  BMDP,  SAS,  and  SPSS”  are 
packages  likely  to  be  encountered  in  an  academic 
computing  environment.  In  this  paper,  we 
consider  the  time  series  features  of  BMDP,  SAS, 
and  SPSS’1  and  compare  the  packages  using 
simulated  data.  The  results  of  a  small  Monte 
Carlo  study  comparing  three  parameter  estimation 
techniques  are  also  presented. 

The  implementations  we  will  examine  involve 
batch  processing  on  an  IBM  3081  mainframe 
computer  at  Arizona  State  University.  While 
interactive  processing  is  preferred  for  many 
applications,  it  typically  requires  greater 
overhead  and  is  not  usually  feasible  for  large 
academic  computing  systems.  The  new  generation 
of  supermicro’s  and  mini’s  will  clearly  alter 
this  situation,  and  in  fact  has  already  had  an 
impact  on  statistical  software  vendors  as  is 
evident  from  the  availability  of  versions  of 
BMDP,  SAS,  and  SPSS”  for  microcomputers.  Large 
data  sets  and  intensive  computational  overhead 
make  time  series  analysis  more  appropriate  for 
larger,  faster  computers.  However,  rapid  changes 
are  occurring  in  both  hardware  and  software.  Our 
evaluations  must  be  judged  in  that  context. 

There  are  many  criteria  that  one  could  use  to 
evaluate  packages.  This  paper  will  consider  how 
flexible,  up-to-date,  and  accurate  the  three 
packages  are  with  respect  to  time  series 
analysis.  Here  flexibility  may  refer  to  two 
aspects:  how  easy  output  nay  be  manipulated  to 
produce  displays  and  further  analyses,  and  how 
many  methods  of  analysis  are  available.  All 
packages  appear  to  be  accurate  within  the 
limitations  of  the  floating  point  arithmetic  used 
and  the  algorithms  employed.  Accuracy  with 
respect  to  the  estimation  algorithm  employed  will 
be  emphasized  in  this  work. 

2.  TOC  SERIES  MODELING 

Most  packages  use  Box  and  Jenkins  (1976)  as 
the  primary  reference  for  univariate  tine  series 
modeling.  Judge,  et.  aj. ,  (1985)  provide  a 
useful  stmnary  of  the  available  theory  related  to 
univariate  and  multivariate  time  series  modeling. 
The  so-called  Box- Jenkins  modeling  strategy  is 
incorporated  into  the  design  of  the  programs  of 
BMDP,  SAS,  and  SPSS* .  This  strategy  is 
siamarized  as:  identification  *  estimation  * 
diagnostic  checking.  After  a  model  passes  the 
diagnostic  checking  step,  one  nay  produce 
forecasts  using  the  model. 

For  convenience,  we  will  restrict  attention  to 
autoregressive  integrated  moving  average  (ARIMA) 
models  that  incorporate  a  transfer  function 
component .  The  general  form  of  the  model  is 


k 

•  (B)  [Yt  -  st  -  E  Dt(B)Xit)  =  *(B)«t,  (2.1) 

i=l 

where  B  is  the  backshift  operator  defined  by 
BkYt=Yt-k,  Yt  is  the  original  series  or  a 
transformation  of  the  original  series,  st  is  a 
mean  or  trend  tern,  the  polynomial 
♦(B)=l-*iB-p2B2-*«*-#rBr  nay  be  the  product  of 
stationary  seasonal  and  nan-seasonal 
autoregressive  components  and  nonstationary 
differencing  components,  the  polynomial 

•  (B)=l-*iB-#zB2-----#iB»  nay  be  the  product  of 
invertible  seasonal  and  non-seasonal  moving 
average  components ,  et  is  an  independent  Gaussian 
white  noise  process,  Xit,...,  Xtt  are  exogenous 
variables,  and  the  transfer  functions 
0i (B)=ei (B)/6i (B)  are  ratios  of  polynomials  of 
possibly  varying  orders  in  general  given  by 
a(B)=eo— si  B-eaB2-*  •  • ,  0(B)=1— 6iB— SaB2— ••••  See 
Box  and  Tiao  (1975)  for  a  complete  description  of 
transfer  function  models  with  ARIMA  errors. 

Given  realizations  yi ,  yz,...,  y»  of  a  time 
series  Yt ,  one  may  estimate  the  parameters  of  the 
model  (2.1)  using  nonlinear  least  squares  or 
exact  maximum  likelihood.  Kohn  and  Ansley  (1985) 
provide  one  of  the  most  recent  algorithms  for 
evaluating  the  likelihood  function.  Ansley  and 
Newbold  (1980)  compare  the  unconditional  least 
squares,  conditional  least  squares,  and  maximum 
likelihood  techniques  for  parameter  estimation. 
The  algorithm  of  Kohn  and  Analey  appears  to  be 
the  best  available  method  for  parameter 
estimation  based  on  numerical  and  statistical 
criterion.  Newton  (1981)  provides  a  useful 
discussion  of  available  estimation  techniques. 

The  identification  stage  of  the  modeling 
process  relies  on  subjective  examination  of 
sample  functions.  Many  sources  restrict 
attention  to  the  sample  autocorrelations  and 
partial  autocorrelations.  Frequency  domain 
quantities  such  as  periodograms  and  sample 
spectral  densities  nay  also  be  employed. 
Transfer  function  models  are  identified  using 
cross-correlations  and  cross-spectral  densities. 
Recent  diagnostic  tools  include  the  shifted 
S-array  of  Woodward  and  Gray  (1981),  objective 
order  determining  criteria  such  as  AIC  (Akaike 
1974)  and  CAT  (Parzen  1977),  and  canonical 
correlation  diagnostics  suggested  by  Priestley, 
Rao,  and  Tong  (1974),  Akaike  (1976),  and  Tsay  and 
Tiao  (1985). 

3.  DESCRIPTION  OF  THE  PACKAGES 

We  will  address  sosw  of  the  basic  features  of 
each  package  in  this  section.  The  next  section 
will  provide  comments  and  comparisons  related  to 
univariate  analysis  of  a  time  aeries. 

BMDP  has  two  programs  for  time  series 
analysis:  BMDP1T  and  BMDP2T.  BMDP1T  performs 
univariate  and  bivariate  spectral  analysis. 
BMDP2T  performs  Box-Jenkins  time  series  analysis 
including  transfer  function  models.  BMDP 
provides  a  basic  user’s  manual  and  numerous 
technical  reports.  The  documentation  provided  by 
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the  basic  Manual  is  adequate  for  an  experienced 
tiae  series  analyst  but  probably  difficult  for 
the  student  or  novice.  BM)P  anticipates  Many  of 
the  needs  of  tiae  series  analysts  and  has 
implmsented  techniques  that  are  current  up  to 
about  1962. 

BMDP  has  the  Most  priaitive  data  handling 
capabilities  of  the  three  packages.  The 
TRANSFORM  paragraph  is  used  to  transfora  response 
or  predictor  variables  or  to  create  indicator 
variables.  No  date/tiae  functions  are  provided 
to  label  the  tiae  fraae  of  the  data.  BMDP 
provides  few  features  for  saving  or  Merging  data 
sets.  For  exaaple,  producing  residuals  froa  a 
regression  analysis  using  BMDP1R  and  analyzing 
them  using  BBBP2T  is  aore  difficult  than  using 
comparable  procedures  in  SAS  or  SPSS1. 

BMDP1T  is  the  Most  comprehensive  frequency 
domain  implementation  of  the  three  packages.  It 
calculates  periodograms,  cross-peri odograas,  and 
smoothed  or  filtered  versions  of  these.  BMDP1T 
has  options  for  replacing  missing  values  of  a 
series.  Parametric  apectral  estimation  is 
availsble  using  autoregressive  filters. 

BMDP2T  is  a  complete  iapleaentation  of 
Box-Jenkins  ARIMA  Modeling  that  includes 
capabilities  for  handling  transfer  function 
models.  No  multivariate  capabilities  beyond  the 
inclusion  of  Multiple  transfer  functions  are 
available. 

SAS  provides  a  library  of  procedures  in  the 
SAS/BTS  product  that  perform  a  wide  variety  of 
time  domain  computations.  SAS  provides  numerous 
manuals,  instructional  texts,  and  technical 
reports  that  document  its  features.  The  SAS/BTS 
User’s  Guide  is  the  basic  manual  for  the  SAS/BTS 
product.  The  documentation  in  the  SAS/BTS  guide 
is  adequate  for  experienced  tiae  series  analysts 
but  perhaps  somewhat  confusing  to  the  novice. 
The  documentation  is  aore  comprehensive  and  easy 
to  follow  than  the  comparable  BMDP  product.  SAS 
anticipates  many  of  the  needs  of  time  series 
analysts  and  haa  implemented  techniques  that  are 
current  up  to  about  1983. 

SAS  has  the  most  advanced  data  management 
features  of  the  three  packages.  It  haa  a  large 
number  of  built  in  date/tiae  functions  for 
labeling  the  time  frame  of  the  data.  Variable 
transformations  are  handled  by  the  DATA  step.  In 
fact,  the  wide  range  of  built-in  mathematical  and 
statistical  functions  make  it  possible  to  use  the 
DATA  step  to  program  simple  applications  such  as 
multiplicative  decomposition  seasonal  adjustment. 

SAS/BTS  has  r  niaber  of  procedures  to  perform 
univariate  and  multivariate  forecasting.  These 
include  ARIMA,  AUTORBG,  FORECAST,  and  STATESPACE. 
Other  linear  and  nonlinear  modeling  procedures 
are  also  available.  SPECTRA  calculates 
periodograms,  cross-peri odograas,  and  smoothed 
versions  of  these.  Eernel  estimates  and 
parametric  estimates  of  a  spectral  density  are 
not  provided.  No  filtering  other  than  moving 
average  filtering  is  available. 

SPSS*  has  one  procedure,  BOX-JENKINS,  for 
performing  univariate  tiae  series  computations. 
SPSS*  provides  mamrous  manuals  and  instructional 
texts  that  dociaent  its  features.  The  SPSS* 
User's  Guide  is  the  basic  manual  for  users  of 
SPSS*.  The  dociaMntation  in  the  SPSS*  guide  is 
adequate  for  experienced  time  series  analysts  but 
perhaps  somewhat  confusing  to  the  novice.  The 


documentation  is  sore  comprehensive  and  easy  to 
follow  than  the  comparable  BMDP  product 
(univariate  modeling  only),  but  inferior  to  the 
corresponding  SAS  product.  SPSS*  anticipates 
many  of  the  needs  of  tine  series  analysts  who  use 
only  Box-Jenkins  univariate  ARIMA  models  and  has 
implemented  techniques  that  are  current  up  to 
about  1976. 

SPSS*  has  advanced  data  management  features 
coaparable  to  earlier  versions  of  SAS.  It  has  a 
large  number  of  built  in  date/tiae  functions  for 
labeling  the  tine  frame  of  the  data.  Variable 
transformations  are  handled  by  the  COMPUTE 
caMand.  Like  SAS,  SPSS*  has  a  wide  range  of 
built-in  mathematical  and  statistical  functions. 

SPSS*  has  no  multivariate  tine  series 
capabilities  and  no  frequency  domain 
capabilities.  Univariate  transfer  function 
modeling  is  not  available  from  SPSS* .  The 
BOX- JENKINS  procedure  implements  the  basic  ARIMA 
modeling  strategy  described  in  Box  and  Jenkins 
(1976). 

4.  UNIVARIATE  IMPLEMENTATION  OF  BOX-JENKINS 
TIME  SERIES  MODELING 

Consider  model  (2.1)  without  the  transfer 
function  component .  To  identify  To  identify  the 
nature  of  the  polynomials  p(B)  and  *(B),  BNDP2T, 
SAS  PROC  ARIMA,  and  SPSS*  procedure  BOX-JENKINS 
all  provide  the  sample  autocorrelation  and 
partial  autocorrelation  functions.  For 
identifying  seasonal  components,  BMDP  and  SAS 
permit  calculation  of  the  periodogram  and  various 
smoothed  versions  of  the  periodogram. 

The  display  of  the  sample  autocorrelations  and 
partial  autocorrelations  is  very  similar  for  all 
three  packages.  None  of  the  packages  permit 
alternate  forms  of  plotting  the  sa^>le 
autocorrelations,  and  none  allow  the 
autocorrelations  to  be  saved  for  later  use. 
Consequently,  more  desirable  pen  plotter  versions 
cannot  be  obtained,  even  though  SAS  and  SPSS* 
have  advanced  routines  for  accessing 
sophisticated  plotting  hardware.  Thus, 
publication  quality  plots  of  the  sample 
autocorrelations  must  be  obtained  using  other 
resources.  One  approach  is  to  reed  the 
appropriate  output  page  using  a  program  written 
in  a  lower  level  language  like  C  or  FORTRAN  that 
reconstructs  the  desired  sample  function,  and 
then  use  SAS  to  read  in  the  reconstructed 
function  and  plot  it  using  SAS /GRAPH. 

If  the  identification  stage  reveals  that  the 
series  is  nonstationary,  various  transformations 
or  differencing  operations  nay  be  performed.  All 
packages  have  the  ability  to  perform 
transformations  "outside  of"  the  procedures  used 
to  carry  out  the  analysis.  In  addition,  SPSS* 
procedure  BOX-JENKINS  permits  logarithmic  or 
power  transformations  as  options  within  the 
procedure.  The  advantage  of  the  SPSS* 
implementation  is  that  the  user  need  not  worry 
about  untransforming  the  forecasts.  All  program 
allow  differencing  of  the  input  series  to  be 
performed  within  the  procedure  being  used. 

When  a  model  haa  been  identified,  the 
parameters  may  be  estimated  using  conditional 
least  squares  (CIS),  unconditional  least  squares 
(ULS,  or  the  backforecasting  approach  described 
in  Box  and  Jenkins,  1976),  or  maxiaia  likelihood 
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(ML) .  Neither  BMP2T  nor  SPSS"  BOX-JKNKINS 
provide  maximum  likelihood  estimates.  SAS  PROC 
AHIMA  uaee  one  of  the  More  efficient  algorithms 
for  obtaining  ML  eatiaates.  CLS  and  ULS 
estiaates  are  obtained  uaing  a  nonlinear  least 
squares  algorithm.  BMP2T  and  SAS  PROC  AHIMA 
seen  to  use  a  aore  efficient  algoritha 
( Oauss-Marquardt )  for  nonlinear  least  squares 
than  does  SPSS11  procedure  BOX-JKNKINS  (pattern 
search),  although  numerical  reaults  using  the 
ssae  est last ion  technique  are  usually  in  close 
agreement  for  the  three  packages. 

SAS  PROC  ARINA  has  a  unique  feature  that 
allows  alternate  parameterization  for  a  transfer 
function  aodel.  AHIMA  aleo  produces  an  estiaate 
of  st=s  labeled  MU,  and  for  models  with  AR 
components,  ARINA  provides  an  estiaate  labeled 
CONSTANT  which  is  the  stable  aean  typically 
called  to  by  same  authors  and  defined  by 
*o=s(l-C#k ) .  BMDP  and  SPSS*  provide  only  the 
estiaate  of  So .  When  no  AR  components  are 
present ,  s=*o . 

The  three  packages  have  similar  options  for 
obtaining  forecasts.  Options  include  specifying 
at  what  time  point  the  forecasts  should  begin  and 
haw  many  time  points  beyond  the  end  of  the  series 
the  forecasts  should  extend.  BMP2T  does  not 
display  confidence  intervals  for  the  forecasts, 
but  does  provide  standard  errors  that  aay  be  used 
to  compute  such  intervals.  SAS  PROC  ARINA 
provides  95*  confidence  intervals.  SPSS* 
BOX-JKNKINS  provides  confidence  intervals  and 
allows  the  user  to  specify  the  desired  confidence 
coefficient.  Note  that  such  confidence  intervals 
are  usually  valid  only  when  derived  using  a  long 
series  of  data,  i.e.,  the  intervals  are  based  on 
asynptotic  theory  and  not  on  exact  distribution 
theory. 

All  three  packages  display  forecasts  and 
confidence  intervals  and/or  standard  errors  in  a 
column  listing.  SPSS*  BOX-JKNKINS  also  displays 
forecasts  in  a  table  siailar  to  that  found  in  Box 
and  Jenkins  (1976,  Table  5.2,  page  136).  In 
addition,  SAS  PROC  ARINA  allows  the  forecasts  and 
related  statistics  to  be  saved  in  an  output  data 
set. 

All  plotting  of  sample  functions,  forecasts, 
and  related  statistics  is  "internal"  and  beyond 
the  control  of  the  user  in  BMDP2T  and  SPSS* 
BOX-JKNKINS.  Alternately,  while  SAS  PROC  ARINA 
plots  sample  autocorrelations  and  partial 
autocorrelations  (and  inverse  autocorrelations), 
the  PLOT  and  GPLOT  procedures  are  employed  in  SAS 
to  customize  desired  plots  of  original  data, 
forecasts,  and  confidence  intervals.  Thus,  SAS 
seems  aore  ideally  suited  for  tiae  series 
analyses  that  are  to  be  published. 

Few  modern  diagnostic  tools  are  aade  available 
by  the  three  packages.  SAS  PROC  ARINA  produces 
the  inverse  autocorrelation  function  in  the 
IDENTIFY  step.  PROC  ARINA  also  provides  the 
value  of  Akaike's  AIC  criterion  for  a  fitted 
aodel.  BMDP2T,  SAS  PROC  ARINA,  and  SPSS* 
procedure  BOX-JKNKINS  provide  a  chi-square  test 
of  the  residuals  for  white  noise.  All  three 
packages  produce  t-ratlos  for  estimated 
parameters  and  sample  autocorrelations  for  the 
residuals.  Again,  note  that  the  t-ratios 
correspond  to  asymptotic  theory,  since  no  exact 
distribution  theory  is  known.  For  this  reason, 
the  packages  wisely  refuse  to  print  p-values 


whose  interpretation  would  be  questionable  for 
smaller  series. 

All  packages  also  report  the  residual  variance 
and  other  statistics  related  to  the  original 
series  and  the  residuals.  The  terminology  used 
by  SAS,  e.g. ,  "VARIANCE  ESTIMATE"  instead  of 
"RESIDUAL  VARIANCE",  aay  be  confusing  to  some. 
All  packages  require  same  investigation  to 
deteraine  the  divisor  employed  to  obtain  the 
residual  variance.  The  divisor  appears  to  be  the 
degrees  of  freedom  formed  by  subtracting  degrees 
of  differencing  and  nuri>er  of  parameters 
estiaated  from  the  total  aeries  length. 

None  of  the  packages  have  options  within  the 
procedures  or  programs  to  carry  out  a  complete 
residual  analysis.  Reliance  an  a  composite 
chi-square  teat  for  white  noise  is  inadequate  for 
aany  situations.  However,  SAS  has  the  greatest 
flexibility  in  retaining  the  residuals  and 
carrying  out  a  residual  analysis  using  other  SAS 
procedures,  such  as  PROC  MANS,  PROC  UNIVARIATE, 
and  PROC  PLOT.  With  a  little  aore  effort.  SPSS* 
can  achieve  siailar  results,  while  BMP  requires 
auch  greater  effort  to  carry  out  a  collate 
residual  analysis. 

Finally,  note  that  SAS  PROC  AUTORBG  and  SAS 
PROC  FORECAST  have  same  features  that  aay  be 
useful  in  Box- Jenkins  aodel ing.  Primarily,  these 
features  are  designed  for  purely  autoregressive 
processes . 

Display  1  summarizes  the  features  of  each 
package. 

5.  EXAMPLES  OF  UNIVARIATE  TIME  SERIES  MODELING 

Three  aodela  and  two  series  lengths  (n=50  and 
n=100)  were  used  to  simulate  data  in  order  to 
ccagxure  the  estimation  algorithms  of  the  three 
packages.  The  three  models  employed  were: 

(1)  yt  -  0.95yt-i  =  at  ; 

(2)  yt  =  et  -  0.95et-i  ; 

(3)  yt  -  1.5yt-t  +  1.21yt-2  -  0.455yt-a 
=  «t  +  0.2et-i  +  0.9at-z  . 

Model  (3)  was  suggested  by  Woodward  and  Gray 
(1981). 

Initially,  one  series  was  generated  for  each 
model  and  sample  size.  Results  were  different 
across  all  packages,  although  in  aany  cases 
results  were  very  siailar.  The  differences  can 
be  attributed  to  the  following  differences  in 
inpleaentation: 

1.  Default  values  for  convergence  of  parameter 
estiaates  differ.  SAS  and  SPSS*  use  a 
stopping  criterion  of  0.001,  while  BMP  uses 
0.0001. 

2.  The  stopping  rules  vary  across  packages, 
e.g.,  BMP  has  a  stopping  rule  related  to  the 
relative  change  in  the  residual  sun  of 
squares,  while  SAS  and  SPSS*  appear  to  use 
only  the  relative  difference  in  consecutive 
parameter  estiaates. 

3.  The  three  packages  use  different  nwerical 
optiaization  routines. 

4.  Only  SAS  restricts  estiaates  by  default  to 
fall  within  the  stationary  or  invertible 
region. 
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5.  SAS  appears  to  use  double  precision  for  all 
computations,  while  SPSS*  uses  single 
precision  for  some  computations.  BMDP  does 
all  computations  in  single  precision. 

While  one  can  use  certain  options  to  control 
soae  of  the  above  factors,  it  appears  that  the 
packages  cannot  be  made  to  provide  identical 
results.  SAS  PROC  AR1MA  can  eliminate  the 
stationary  and  invertible  restriction.  SPSS"  and 
BMDP  do  not  have  an  option  to  force  stationarity 
or  invertibility.  All  packages  allow  one  to  fine 
tune  convergence  criteria,  and  all  packages  allow 
control  of  starting  values  for  estimation. 

We  decided  that  simulations  generated  to 
compare  the  packages  might  unfairly  favor  the 
capabilities  of  one  or  more  packages.  For 
example,  the  first  two  models  considered  have 
roots  near  the  unit  circle.  The  ULS  method 
appears  to  perform  better  for  models  with  roots 
near  the  unit  circle,  while  otherwise  it  appears 
to  be  inferior  to  the  CLS  and  ML  methods  (Ansley 
and  Newbold  1980).  Thus,  the  package  with  the 
"best"  implementation  of  ULS  might  appear 
superior  to  the  others  even  if  it  had  an  inferior 
implementation  of  the  other  techniques. 

Display  2  gives  some  results  using  n=50  for 
all  three  models.  Note  that  in  some  cases 
results  are  quite  different  and  that  choice  of 
package  and  estimation  technique  may  be  very 
important.  Unfortunately,  no  guidelines  are 
evident  other  than  those  suggested  by  Ansley  and 
Netd>old  (1980),  namely  to  use  maximum  likelihood 
if  possible  over  ULS  and  CLS.  Since  neither  BMDP 
nor  SPSS*  provide  ML  estimates,  this  advice 
favors  S AS. 

A  comprehensive  study  is  warranted  to 
determine  what  set  of  estimation  algorithms  and 
default  tuning  parameters  may  be  preferred  for 
given  types  of  models.  Experience  suggests  that 
no  given  technique  or  set  of  defaults  will  be  a 
clear  winner. 

In  order  to  gain  some  insight  into  the 
validity  of  the  Ansley  and  Newbold  (1980) 
findings,  we  ran  a  small  simulation  study  using 
five  replicates  for  each  model/sample  size 
combination.  We  used  SAS  PROC  ARIMA  with  its 
default  settings  to  compare  the  three  estimation 
techniques.  The  design  was  a  3  by  2  factorial 
repeated  measures  experiment  with  response 
variable  MSB  defined  by 

MSB  =  t  [parameter- estimated] 2 /(no.  parameters) 

We  define  N  as  the  sample  size  classification 
variable  and  MODBL  as  the  classification  variable 
identifying  the  model  used  to  simulate  the  data. 
The  repeated  measure  was  MSB  taken  over  the  three 
techniques,  CLS,  ULS,  and  ML,  for  each  data  set 
(experimental  unit).  The  analysis  clearly 
revealed  the  presence  of  interaction  between 
MODBL  and  N.  Univariate  corrected  F-teats 
indicated  that  the  MODBL  means  were  significantly 
different  for  all  three  repeated  measures.  In 
addition,  N  and  MODELSN  means  were  significantly 
different  for  the  CLS  and  ML  methods. 
Statistical  significance  was  judged  at  the  10k 
level . 

Given  the  presence  of  interaction,  we  proceed 
to  examine  the  value  of  MSB  for  each  technique 
averaged  over  MODEL,  N,  and  MODBLtN  to  gain 


insight  into  the  behavior  of  the  techniques. 
Display  3  provides  siMmry  tables  giving  the 
relevant  averages.  Also  given  are  averages  using 
MAD  rather  than  MSB,  where  MAD  is  defined  by 

MAD  =  t  parameter- estimate; /(no.  parameters). 

Note  the  cases  where  MSB  and  MAD  provide 
different  orderings  of  the  average  values. 

Our  results  seem  to  agree  with  those  obtained 
by  Ansley  and  Net&old  (1980).  In  particular,  the 
ML  method  seemed  to  work  best  for  small  samples 
and  for  more  complicated  models,  while  ULS  seemed 
to  perform  well  for  the  two  models  with  roots  of 
the  characteristic  polynomials  near  the  unit 
circle.  Ansley  and  Newbold  (1980)  observe  that 
ULS  estimates  tend  to  be  poor  in  the  sense  that 
they  often  give  estimates  yielding  characteristic 
polynomials  with  roots  near  the  unit  circle  even 
when  the  underlying  model  does  not  exhibit  roots 
near  the  unit  circle. 

6.  CONCLUDING  REMARBS 

If  one  had  to  choose  a  single  package  for  time 
series  analysis,  SAS  would  probably  be  the  choice 
because  it  appears  to  provide  the  most  options 
and  flexibility.  On  the  other  hand,  with  access 
to  all  three  packages,  there  are  situations  where 
BMDP  or  SPSS11  might  be  used  instead  of  or  along 
with  SAS.  I  have  seen  many  clever  things  done 
with  the  SAS  DATA  step  or  with  SAS  PROC  MATRIX, 
including  seasonal  adjustment,  state  space 
modeling,  and  kernel  spectral  estimation.  (In 
some  cases,  the  use  of  a  lower  level  language 
would  have  been  preferred,  but  the  motivation  was 
more  to  show  that  "SAS  could  do  it"  rather  than 
"this  is  the  way  it  should  be  done".) 

Historically,  BMDP  seems  to  have  been  the 
first  of  the  three  packages  to  provide  a  fully 
implemented  version  of  Box-Jenkins  transfer 
function  modeling.  SAS  followed  with  an  upgraded 
version  of  PROC  ARIMA  a  few  years  later  that 
matched  the  capabilities  of  BMDP2T.  SPSS* 
currently  has  no  transfer  function  capabilities. 
SAS  has  yet  to  match  BMDP’s  frequency  domain 
capabilities,  and  SPSS*  has  no  frequency  domain 
capabilities.  On  the  other  hand,  the  SAS/BTS 
product  provides  a  comprehensive  collection  of 
tools  useful  to  time  series  analysts,  although, 
as  the  name  implies,  economic  applications 
dominate. 

Finally,  recalling  the  brief  discussion  above 
on  the  changing  technology,  the  future  looks  to 
more  sophisticated  interactive  time  series 
analysis  programs.  One  expects  to  see 
significant  changes  in  the  three  products 
mentioned  over  the  next  few  years  to  keep  pace 
with  the  statistical  theory  and  the  computer 
technology. 

The  information  contained  in  this  paper  was 
obtained  from  the  manuals  listed  in  the 
references  and  from  the  author's  experience  with 
the  packages.  None  of  the  software  vendors  were 
contacted  to  confirm  this  information.  Hence, 
care  should  be  exercised  in  using  this 
information  to  help  select  a  package  for 
performing  time  series  analysis. 
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Display  1.  lot-J enkins  Tiae  Series  features  of  the  Three  Packages 
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feature  is  available  in  another  procedure  or  requires  some 
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Display  2.  Results  for  Siiulated  Nodels,  N--50 
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Display  3a.  Overall  Averages 


Display  3c.  Model  Averages 
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MSB (ML) 

5 

0.00395779 

0.00799318 

MSB (ML) 

10 

0.00294754 

0.00144234 

MSE(ULS) 

5 

0.00321348 

0.00571738 

MSE(ULS) 

10 

0.00275520 

0.00081274 

MAD(CLS) 

5 

0.04081240 

0.04347671 

MAD(CLS) 

10 

0.06331050 

0.03071304 

MAD(ML) 

5 

0.04078180 

0.05355648 

MAD (ML) 

10 

0.05332070 

0.01077235 

MAD(ULS) 

5 

0.03903800 

0.04595537 

MAD(ULS) 

10 

0.05210290 

0.00670723 

-  M0DEL= 

1  N=100  - 

— 

M0DEL=3  - 

MSE(CLS) 

5 

0.01460279 

0.02593161 

MSE(ML) 

5 

0.01599610 

0.02668808 

MSB(CLS) 

10 

0.08881410 

0.18256961 

MSE(ULS) 

5 

0.01445261 

0.02557444 

MSE(ML) 

10 

0.07576079 

0.18695201 

MAD(CLS) 

5 

0.08491980 

0.09612111 

MSE(ULS) 

10 

0.07822707 

0.18687692 

MAD(ML) 

5 

0.09650740 

0.09139488 

MAD(CLS) 

10 

0. 19463192 

0.15043879 

MAD(ULS) 

5 

0.08667780 

0.09313678 

MAD (ML) 

10 

0.15485301 

0.16217373 

MAD(ULS) 

10 

0. 15998148 

0. 16291455 

-  MODEL 

i=2  N=50 - 

MSE(CLS) 

5 

0.0 0529622 

0.00625272 

MSE(ML) 

5 

0.00248983 

0.00000631 

MSE(ULS) 

5 

0.00249826 

0.00000111 

Display 

3d. 

Saaple  Size  Averages 

MAD(CLS) 

5 

0.06567540 

0.03505294 

MAD  (ML) 

5 

0.04989820 

0.00006322 

Response ( aethod ) 

n 

MEAN 

ST.  DEV. 

MAD(ULS) 

5 

0.04998260 

0.00001110 

— 

-  N=50  - 

— 

II 

hJ 

u 

1 

1 

1 

1 

2  N=100  - 

MSE(CLS) 

15 

0.05037195 

0.15388066 

MSE(CLS) 

5 

0.00441814 

0.00422492 

MSB (ML) 

15 

0.05000160 

0.15452066 

MSB (ML) 

5 

0.00340524 

0.00203887 

MSE(ULS) 

15 

0.05096727 

0.15489169 

MSE(ULS) 

5 

0.00301214 

0.00114943 

MAD(CLS) 

15 

0.11407201 

0.14307293 

MAD(CLS) 

5 

0.06094560 

0.02965998 

MAD (ML) 

15 

0. 10799933 

0. 14645329 

MAD(ML) 

5 

0.05674320 

0.01522531 

MAD(ULS) 

15 

0.11060924 

0.14698484 

MAD(ULS) 

5 

0.05422320 

0.00948585 

-  N=100  - 

- MODEL 

=3  N=50  - 

— 

MSE(CLS) 

15 

0.01800244 

0.02510749 

MSE(CLS) 

5 

0. 14264180 

0.25853880 

MSE(ML) 

15 

0.00912192 

0.01581345 

MSE(ML) 

5 

0. 14355716 

0.25902080 

MSE(ULS) 

15 

0.00890961 

0.01556248 

MSE(ULS) 

5 

0.14719005 

0.25802466 

MAD(CLS) 

15 

0.09980034 

0.07964706 

MAD(CLS) 

5 

0.23572824 

0.20096962 

MAD(ML) 

15 

0.07654621 

0.05517159 

MAD(ML) 

5 

0.23331799 

0.20664777 

MAD(ULS) 

15 

0.07268561 

0.05698740 

MAD(ULS) 

5 

0.24280711 

0.20164603 

-  M0DEL= 

3  N=100  - 

_ 

MSB(CLS) 

5 

0.03498640 

0.03017236 

MSB (ML) 

5 

0.00796441 

0.00756469 

MSB(ULS) 

5 

0.00926409 

0.01050028 

MAD(CLS) 

5 

0.15353561 

0.07943903 

MAD(ML) 

5 

0.07638804 

0.03287154 

MAD(ULS) 

5 

0.07715584 

0.04366065 

Algorithms  for  Nonlinear  Generalized  Cross  Validation 
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A  variety  of  penalized  nonlinear  problems  can  be 
expressed  as  the  iterated  solution  to  a  nonlinear  minimization,  in 
which  the  inner  step  involves  minimizing  a  penalized  weighted 
least  squares  expression.  We  propose  algorithms  when  matrices 
in  the  least  squares  problem  may  depend  on  the  unknown  param¬ 
eters.  The  problems  in  increasing  complexity  are  (a)  generalized 
linear  models,  (b)  iterated  reweighted  least  squares,  and  (c)  gen¬ 
eral  nonlinear  problems.  The  algorithms  are  built  around 
GCVPACK  (Bates,  Lindstrom,  Wahba  and  Yandell,  1985),  a 
package  for  generalized  cross-validation,  using  a  balance  of 
Cholesky  and  singular  value  decompositions  which  is  adjusted 
depending  on  the  type  of  problem. 

1.  Introduction 

A  variety  of  penalized  nonlinear  problems  can  be 
expressed  as  the  iteration  to  a  solution  of  a  nonlinear  minimiza¬ 
tion,  in  which  the  inner  step  involves  minimizing  a  quadratic 
form  such  as 


mates,  and  absolute  convergence  of  log(nX).  The  number  of 
iterations  in  (2)  may  be  restricted,  leading  to  rough  estimates 
which  are  fed  into  (3). 

We  do  not  assume  any  special  structure  to  the  design  or  the 
matrices,  except  that  we  suppose  that  W  is  of  full  rank,  and  com¬ 
putationally  invertible.  In  many  cases,  W  is  actually  diagonal, 
but  this  will  not  be  explicidy  used  in  the  linear  algebra. 

Algorithms  for  the  linear  model  (1.2)  have  been  given  by 
many  authors,  most  recendy  in  the  multivariate  form  by  Bates  et 
al.  (1985).  The  algorithms  below  are  extensions  of  Bates  et  al. 
(1985),  building  on  their  Fortran77  package,  GCVPACK. 

2.  Semi-Parametric  Generalized  Linear  Models 

For  semi-parametric  generalized  linear  models  (SGLM), 
one  has  a  parameter  vector  6  which  consists  of  a  parametric 
piece  and  a  "smooth”  nonpara  me  trie  piece, 

8,  =  S/o +/(*,)  ,  i  =1,  •  •  •  ,#i 


-  ||  WT(y-Sa-Tp- K5)  || 2  +  X8tK[/6  (1.1) 

n 

in  which  S,  T  and  K  are  the  design  matrices  for  the  covariates, 
polynomial  and  “smooth”  part  of  the  model,  and  y  and  W  are 
the  responses  and  the  weights.  The  simplest  form  is  the  partial 
spline  model,  or  semi-parametric  linear  model, 

y,  =S;Ta +/(*;) +  £i  ,i«1,  •••,n  ,  (1.2) 

in  which  /  (*)  is  some  “smooth”  function  and  £=(Ci,  •  •  •  ,  e«)T 
has  covariance  matrix  (WWT)-1  which  is  usually  diagonal.  We 
present  three  situations  and  proposed  computational  solutions 
when  matrices  in  the  above  linearized  problem  may  depend  on 
the  unknown  parameters.  The  problems  in  increasing  degree  of 
complexity  are: 

(1)  Semi-parametric  generalized  linear  models,  in  which  S,  T, 
K  and  Ky  are  constant,  while  W  and  y  may  change  with 
each  iteration. 

(2)  Iteratively  reweighted  least  squares,  in  which  only  K u 
remains  constant. 

(3)  General  nonlinear  problems  (remote  sensing,  for  example), 
in  which  all  matrices  may  change  with  each  iteration. 

Different  compromises  are  suggested  by  each  problem.  Clearly, 
one  would  like  to  decompose  the  constant  matrices  exactly  once 
and  would  like  to  keep  decompositions  of  die  changing  matrices 
as  cheap  as  possible.  The  method  proposed  here  combines  the 
advantages  of  SVD  in  locating  the  generalized  cross  validation 
choice  of  X  with  Cholesky  decompositions  which  are  relatively 
cheap  once  X  is  fixed.  While  the  decompositions  suggested  are 
not  new,  the  combination  of  approaches  appears  to  be  an  unex¬ 
plored  area.  The  basic  strategy  is  as  follows: 

( 1 )  guess  at  initial  X  ( = »)  and  (PT,  aT,  8T) 

(2)  CD:  iterate  (part-way)  to  solution  for  fixed  X 

(3)  linearize  the  problem  as  in  (1.1) 

(4)  SVD:  pick  optimal  X  via  GCV 

(5)  iterate  (2)-(4)  to  convergence 

Convergence  criteria  can  include  absolute  or  relative  conver¬ 
gence  of  the  regularization  functional  and/or  the  parameter  esti¬ 


One  can  formulate  the  problem  as  minimizing,  for  fixed  X, 
Sx(0)  =  /,(8)  +  X/(8)  ■ 


in  which  L  is  the  log  likelihood  and  J  is  the  smoothing  penalty 
(see  Good  and  Gaskins  (1971);  Leonard  (1982);  Green,  Jennison 
and  Seheult  (1983);  O’Sullivan,  Yandell  and  Raynor  (1986); 
Green  and  Yandell  (1985)).  We  know  from  O’Sullivan  (1983) 
that  if  L(0)  is  suitably  convex  and  7(8)  is  a  quadratic  form  (e.g., 
the  squared  norm  of  a  projection),  then  Sx( 8)  has  a  unique 
minimum  for  each  X.  These  conditions  appear  to  hold  for  many 
generalized  linear  models. 

One  can  choose  X  to  minimize  the  GCV  criterion  (Craven 
and  Wahba,  1979).  which  is  “close”  to  minimizing  the  predic¬ 
tive  mean  square  error  (see  Craven  and  Wahba  (1979);  Speck- 
man  (1985);  Cox  (1983)).  What  we  propose  to  do  here  is  to 
iterate  on  8  and  X,  to  find  the  X  which  is  the  GCV  minimizer  and 
the  8  which  minimizes  SffO).  It  is  not  known  whether  such  a 
procedure  will  converge,  but  we  conjecture  that,  if  the  GCV 
minimizer  is  bounded  away  from  0  and  ■»  and  L  is  suitable  con¬ 
vex,  then  it  does  converge. 


The  log  likelihood  can  be  written  in  an  iterative  form  using 
pseudo-values  y  and  pseudo- weights  W, 


WWT  =  E 


y  =  8"+(WWV 


(2-1) 


based  on  8°  from  the  previous  iteration.  Note  that  for  the 
independent  normal  model,  W'1  is  a  diagonal  matrix  of  the  stan¬ 
dard  deviations  and  y  is  the  vector  of  observed  responses.  The 
linearized  log  likelihood  is 


LO)  =  -  ||  WT(y — 8)  || 2  . 
n 


The  penalty  J  can  often  be  written  in  a  nonnegative 
definite  quadratic  form  in  8  (see  Green  and  Yandell  (1985)).  We 
follow  the  spline  literature  and  formulate  it  as 


/  (0) = J  (8)  =  8tK</  8  subject  to  T^8  =  0  . 

Typically  the  kxk  matrix  Ky  and  kxt  matrix  are  either 
derived  from  the  unique  design  points  or  from  a  set  of  user- 
supplied  basis  nodes  (see  Appendix  2  of  Bates  et  al.  (1985)).  If 
we  write  the  parameter  vector  as 

e  =  Sa+T0+K8 

in  which  S  is  the  nxc  covariate  matrix,  T  is  the  nxt  polyno¬ 
mial  matrix,  and  K  is  the  nxk  smooth  matrix,  the  linearized 
problem  becomes  (1.1). 

We  can  locate  the  unique  design  points  ,  and  the 
corresponding  unique  covariates  S1£/ ,  and  form  a  QR  decompo¬ 
sition 

IT{/  :Slt/J  =  FG  =  FiG]  . 

From  this  we  construct  the  (unweighted)  design 


X  =  [T:S:KF2] 

and  penalty 

(2.2) 

(0  0 

I=  [o  tjKah  ' 

(2.3) 

We  decompose  £  using  a  pivoted  Cholesky  followed  by  a 
Householder, 

ET£E  =  LTL  and  LT  =  QR=Q,R,  , 

(2.4) 

and  construct 

Z  =  [Z1:Z2]  =  XEq|R<)  J]  . 

(2.5) 

Finally,  the  original  parameters  are  transformed  to 


The  estimate  of  y  is  found  by  solving 

MY=J2>2TWTy  ,  (2.10) 

with 

M  =  J2TJ2+nXI  . 

The  “hat”  matrix  can  be  formally  written  as 

A(X)  =  W-tf[Jj2|^1JjT]ftWt  (2.11) 

provided  we  can  invert  M.  Naturally,  one  would  iterate  to  new 
pseudo- values  and  pseudo-weights  using  (2.1)  and  repeat  the 
minimization  of  the  objective  function  (2.7).  At  convergence, 
one  can  obtain  the  estimates  of  the  original  parameter  via  (2.6). 

One  may  approach  the  above  solution  for  y  and  the  “hat” 
matrix  A(X)  in  different  ways,  depending  on  whether  one  wishes 
to  choose  a  new  X,  say  via  generalized  cross  validation,  or 
whether  one  wishes  to  leave  X  fixed. 

2.1.  SVD  approach 

One  way  to  choose  a  new  X  is  based  on  generalized  cross 
validation  for  the  linearized  problem  (2.7).  This  is  basically  the 
ridge  regression  problem  of  Golub,  Heath  and  Wahba  (1979). 
Form  a  singular  value  decomposition  of 

j2=udvt  , 

where  U  and  V  are  orthogonal  and  D  is  diagonal,  to  get 
Y=V(D2+nXlr'DUTF2TWTy  . 

The  “hat”  matrix  is 

A(X)  =  Wtf|0  UDz(D2+nU)-iUTj*rTWT  • 

One  can  choose  X  to  minimize  the  GCV  criterion  (Craven  and 
Wahba,  1979) 


In  the  usual  case  that  F2  is  full  rank,  EQ2  is  an  nx(c-t-r) 
matrix  which  permutes  the  coefficients  a  and  (3,  i.e., 
<oT=(PT:aT:0)EQ2.  The  objective  functional  can  now  be 
reparameterized  as 

1  II  WT(y  -  Z2®  -  Zj f)  || 2  +  XyTY  •  (2.7) 

n 

At  this  point,  we  have  done  all  the  “one-time”  decompositions. 
The  following  steps  must  be  redone  each  time  W  and  y  change, 
or  simply  once  for  the  linear  (normal)  model.  We  form  a  QR 
decomposition  of 

WtZ2  =  FG  =  F,G,  , 

and  create 

J  =  (Ji :  J21  =  [FiT:  FjT JWTZ]  , 
leading  to  the  minimization  of 

—  II  FiTWTy-G,co- JiyII  2 


(2.8) 


+  1l!F2TWTy-J2Yll2  +  XyTY 


The  first  term  can  be  made  zero  by  solving  for  to ,  with  any  given 

Y. 

G10)=F1TWTy-J,Y  .  (2.9) 


V(X)  =  ."-l|WT(l-A(X))y|l2 
[rr(I-A(X))]J  ’ 


(2.12) 


or  as  some  intermediate  value  if  this  is  seen  as  being  too  “far” 
from  the  previous  value. 

2.2.  Cholesky  approach 

If  we  choose  to  leave  X  fixed,  one  can  take  the  cheaper 
approach  of  a  Cholesky  decomposition  of 

M  =  J2  J2+  n  XI  =  CTC  , 
leading  to  the  estimate  of  yby  solving 

CTCY=J2TF2TWTy  . 

The  “hat"  matrix  becomes 

I  0 

0  J2C-'C-TJ2T 


A(X)  =  WtF 


ftwt  . 


(2.13) 


This  route  was  followed  by  O’Sullivan,  Yandell  and  Raynor 
(1986),  iterating  to  a  solution  for  fixed  X.  The  "optimal”  X  was 
chosen  by  minimizing  V  (X)  over  a  grid  of  log(X). 

3.  Iteratively  Reweighted  Least  Squares  Models 

Iteratively  reweighted  least  squares  (IRLS)  models  differ 
from  semi-parametric  GLMs  in  that  only  the  penalty  matrix 
remains  fixed  (Green,  1984).  The  log-likelihood  parameter  0  can 
be  locally  linearized,  but  die  S,  T,  and  K  matrices  are  no  longer 


Axed: 


l 

In  ^  m  _  9i/ 

3a  *  dp  *  38 

We  still  only  need  form  and  decompose  1  as  in  (2.3)  and  (2.4) 
exactly  once.  However,  the  (unweighted)  design  (2.2)  may 
\  change  with  each  iteration.  Hence,  the  remaining  computations 

i  need  to  be  done  at  each  iteration.  One  could  proceed  in  the  same 

|  manner  as  for  the  generalized  linear  models,  but  reconstructing 

|  X,  and  hence  Z  and  J,  each  Arne. 

4.  General  Nonlinear  Models 

General  nonlinear  problems  could  proceed  in  the  same 
manner  as  for  IRLS,  except  that  Kv  changes  each  time.  Thus 
>  most  computations  need  to  be  redone.  It  may  be  possible  for 

r  some  nonlinear  problems  to  re  parameterize  them  as  SGLM  or 

I  IRLS  problems  to  eliminate  this  difficulty. 


In  many  situations  we  may  be  only  interested  in  COV(a). 
Further,  if  the  penalty  £  is  of  the  proper  rank,  then  the  QR 
decomposition  of  (2.4)  should  simply  permute  the  indices  for  the 
coefficients.  In  other  words,  EQ2  often  simply  permutes  the 
coefficients  a  (and  p)  into  to.  In  this  case,  let  <f,  denote  the  per¬ 
mutation  for  a, ,  i  =  1 ,  ,  c .  For  the  SVD  approach, 

VAR( Oj)=  || GfTe"  || 2  +  H D(D2+nXir1VTWF1GfTe1 1| 2  . 
For  the  Cholesky  approach, 

vAJt(a,)=  ||  GfTe,  || 2  +  ||c-twf1g,-t«;||2 

-nXi|C-'(TTWF,G,-Te,||2  . 

Joint  work  is  in  progress  with  Peter  J.  Green  (Green  and 
Yandetl,  198S)  on  analogues  to  diagnostic  tools  for  generalized 
linear  models  along  the  lines  of  Pregibon  (1981,  1982)  and 
Nelder  and  Pregibon  (1986). 


|  5.  Diagnostics 

1  The  diagonal  elements  of  die  “hat”  matrix  have  been  used 

I  for  diagnostics  in  generalized  linear  models  (Pregibon,  1981)  as 

well  as  in  smoothing  spline  models  (Eubank  1984,  1983).  It  is 
natural  to  think  of  extending  these  uses  to  the  present  array  of 
|  models  (Green  and  Yandell,  1983;  Green,  1983).  The  diagonal 

elements  can  be  computed  as 

{\(k))u  =  ||  FiTe,  || 2  +  ||M-»F2Te,||2 

in  which  e,  is  the  n -vector  with  a  1  in  the  i-th  position  and  0’s 
elsewhere.  For  the  SVD  approach  this  is  simply 

fA(X);u  =  ||  Ffc  || 2  +  ||  D(D+/rH)-v‘UTF2Te,  || 2  , 

|  and  for  the  Cholesky  approach  (cf.  O’Sullivan  (1985)), 

fA (X)h  =  II  Ffe,  || 2  +  ||  C-TJ2TF2Te,  || 2  . 

Covariance  matrices  can  be  computed  by  noting  that 
COV'(y)  =  W  TW-‘.  We  find  from  (2.11)  that 


COV(Q)  =  WTF 
Hence,  the  variances  are 


I 

0 


0 

j2m-'jJj2m-tj2t 


FtW"t  . 


VAR  ft)  =  ||F1TW-'e,  ||2+  ||  J2M-TJ2TF2TW-'e,  || 2  . 


Noting  the  relation 

M'1  J2tJ2M't  =  M-'(I  -  n  XM-')  , 


the  variances  can  be  written  as 

K4R(0.  )  =  ||  F (T W_le,  II 2  +  ||  C-TJ2TF2TW-’e,- 1| 2 
—  «  X  ||  C,C_TJ2TF2TW'lel  || 2 
for  the  Cholesky  approach.  For  the  SVD  we  have 
VAR(0,)  =  || FfW'e, || 2  +  ||D2(D2+nXI)-1UTF2TW-1ei||2  . 


The  covariance  among  the  coefficients  can  be  derived, 
using  (2.9),  (2. 10)  and  (2.6),  as 


COV 


f?8 


=  EQ2Gr'GfTQ2TET  + 


EQ 


l-Gf'F,TWT 


m-'j2tj2m-t 


*i  rr 


qtet 


6.  Numerical  Comparisons 

We  focus  our  investigations  upon  the  Poisson  and  binomial 
special  cases  of  the  semi-parametric  generalized  linear  model  as 
these  are  potentially  of  wide  interest  and  easy  to  formulate.  We 
allowed  up  to  c  initial  iterations  of  the  Cholesky  decomposition 
(CD)  for  X=«>  (perfectly  smooth  case),  and  up  to  c  CDs  follow¬ 
ing  each  SVD,  where  c  was  1,  2,  or  10.  No  case  required  more 
than  7  CD  following  an  SVD,  or  more  than  7  SVD  overall. 

We  examined  some  real  data  on  leafhopper  oviposition  and 
potato  pathogen  in  a  held,  both  Poisson,  and  data  on  rat  survival, 
which  was  binomial.  In  addition  we  simulated  data  which  we 
thought  might  be  “cumbersome”  for  the  numerical  algorithms. 
The  simulations  were  Poisson  with  a  normal  shaped  corve  of  0  - 
log(mean  value),  with  peak  height  of  between  0=1.5  and  20. 
Binomial  simulations  used  a  similar  normal  shaped  curve  for  0  - 
logit(mean  value),  with  peak  height  of  between  0=logit(.Ol)  and 
logit(.3).  Simulations  were  conducted  for  n  =50  and  100. 

The  Cholesky  steps  in  the  real  examples  increased  the  run 
time  by  20-35%,  including  one-time  costs  and  construction  of 
the  diagonals  of  the  “hat”  matrix  (see  Tables  1-3).  This 
occurred  because  the  number  of  SVDs  was  not  reduced  by  more 
intermediate  CDs,  nor  were  the  sequences  of  optimal  Vs  for  the 
linearized  problems  markedly  altered  by  the  CDs.  In  addition, 
each  CD  took  about  10%  of  the  time  for  an  SVD.  In  these  exam¬ 
ples,  the  signal  was  fairly  apparent,  indicating  that  the  linear 
approximation  was  adequate  using  the  SVD  iterations  alone. 


Table  1. 

Poisson  Oviposition  Data  (n-27) 

task 

c-0 

c-l 

c-2 

c-10 

one-time 

4.40 

4.40 

4.43 

4.50 

cholesky 

0.78 

4.22 

7.78 

11.92 

svd 

24.93 

25.02 

24.73 

24.78 

hat 

2.20 

2.22 

2.23 

2.22 

total 

31.07 

34.57 

37.85 

42.10 

no.  svd 

5 

5 

5 

5 

no.  chol 

1 

6 

11 

19 

1  Table  2.  Binomial  Rats  Data  (n-127) 

task 

c-0 

c-2 

c-10 

one-time 

34.6- 

33.9 

35.0 

cholesky 

7.5 

58.7 

74.2 

svd 

245.0 

245.6 

243.9 

hat 

34.6 

34.8 

35.2 

total 

312.8 

364.2 

379.3 

no.  svd 

5 

5 

5 

no.  chol 

1 

9 

12 

452 


Table  3.  2- 

D  Poisson  Fungi  (n-400,  k-100) 

task 

c— 0 

c-2 

c-10 

one-time 

279 

279 

283 

cholesky 

140 

1004 

1475 

svd 

4486 

4413 

4425 

hat 

594 

598 

598 

total 

i  5354 

6150 

6637 

no.  svd 

7 

7 

7 

no.  chol 

2 

16 

26 

The  simulations  showed  that  when  the  “signal”  is  small 
relative  to  the  “noise”,  the  CDs  seem  to  stabilize  the  minimiza¬ 
tion  problem,  reducing  the  number  of  SVDs  required  and  cutting 
the  run  time.  Table  4(a-b)  present  the  combined  CD  and  SVD 
run  times,  while  Table  4(c-d)  present  the  numbers  of  SVDs  and 
CDs.  As  the  height  of  the  Poisson  peak  rises,  the  CD  iterations 
have  a  reduced  impact  on  convergence.  However,  note  that  on 
several  occassions  iteration  with  only  one  CD  increased  the 
number  of  SVDs  required.  Allowing  more  than  2  CD  steps  only 
seemed  to  increase  the  overall  run  time;  the  number  of  SVDs 
was  reduced  in  only  a  few  instances.  In  addition,  a  few  simula¬ 
tions,  not  shown  here,  converged  when  up  to  2  CDs  per  SVD 
were  allowed,  but  did  not  converge  when  0  or  up  to  10  were 
allowed.  Similar  statements  can  be  made  about  the  binomial 


no.  SVD  /  no.  CD  iterations 
peak  |  c-0  c-1  c-2  c-10 

1.5 

5/0 

4/4 

3/5 

3/10 

2 

6/1 

5/6 

4/8 

4/12 

2.5 

5/0 

5/5 

4/7 

4/12 

3 

5/0 

5/S 

4/7 

4/13 

4 

6/0 

6/6 

5/8 

4/15 

5 

6/0 

6/7 

5/9 

4/15 

6 

5/0 

6/6 

5/9 

3/16 

7 

5/0 

5/5 

4/7 

4/19 

8 

5/0 

6/6 

5/9 

4/14 

9 

5/1 

6/6 

5/9 

4/15 

10 

6/0 

7/7 

6/10 

5/16 

15 

5/1 

6/7 

6/10 

5/18 

20 

6/0 

7/7 

6/10 

5/16 

Table  4(d).  Poisson  Runs  (n-100) 
no.  SVD  /  no.  CD  iterations 


(Table  5(a-b)). 

|  Table  4(a).  Poisson  Run  Times  (n-50)  | 

peak 

c-0 

c=  1 

c-2 

c-10 

1.5 

134 

120 

94 

103 

2 

163 

150 

130 

141 

2.5 

134 

148 

126 

134 

3 

132 

148 

125 

138 

4 

159 

178 

155 

142 

5 

158 

180 

157 

144 

6 

131 

173 

155 

120 

7 

133 

159 

127 

161 

8 

131 

175 

157 

141 

9 

135 

178 

158 

144 

10 

157 

204 

188 

174 

15 

134 

180 

187 

181 

20 

158 

207 

189 

175 

peak 

c«0 

C-1 

c— 2 

c-10 

1.5 

5/0 

4/4 

4/6 

4/10 

2 

5/0 

4/4 

4/7 

4/12 

2.5 

6/0 

5/5 

5/8 

4/11 

3 

4/0 

4/4 

3/6 

3/11 

4 

5/0 

5/5 

4/7 

4/13 

5 

5/0 

5/6 

5/8 

4/14 

6 

5/1 

6/6 

4/9 

4/15 

7 

5/0 

5/5 

4/7 

4/14 

8 

5/0 

5/6 

5/9 

4/17 

9 

5/0 

6/7 

5/9 

4/16 

10 

6/0 

6/6 

5/9 

5/23 

15 

5/0 

6/6 

5/9 

3/13 

20 

5/0 

6/6 

5/9 

4/19 

Table  5(a).  Binomial  RunTimes  (n-100) 


Table  4(b).  Poisson  Run  Times  (n-100) 


size 

prob 

c-0 

c-1 

c-2 

c-10 

10 

.3 

108 

87 

90 

91 

.2 

106 

118 

125 

131 

.1 

133 

118 

92 

97 

.05 

135 

148 

130 

135 

20 

.3 

109 

91 

92 

96 

.2 

137 

119 

123 

127 

.1 

109 

120 

124 

129 

.05 

165 

151 

159 

168 

1.5 

974  848  885  904 

2 

950  834  880  933 

2.5 

1149  1051  1098  932 

3 

759  824  659  718 

4 

956  1048  882  967 

5 

955  1069  1100  988 

6 

970  1244  915  1006 

7 

938  1038  873  970 

8 

939  1053  1105  1043 

9 

955  1280  1138  1026 

10 

1129  1245  1106  1371 

15 

941  1252  1109  762 

20 

962  1276  1131  1143 

Table  5(b).  Binomial  Run  Times  (n. 


size 

prob 

c-0 

C-1 

c-2 

c-10 

10 

.3 

943 

827 

671 

692 

.2 

970 

829 

858 

882 

.1 

968 

860 

885 

937 

.05 

1171 

1064 

898 

977 

.01 

1166 

1046 

1097 

935 

20 

.3 

743 

604 

632 

635 

.2 

760 

617 

636 

645 

.1 

780 

838 

650 

680 

.05 

795 

849 

681 

742 

.01 

1351 

1261 

1103 

1225 

.005 

1513 

1676 

1536 

1683 

s 


w 

w 

s 

$ 


m 


HP! 


Table  5(c).  Binomial  Runs  (n-50) 


no.  SVD  /  no.  CD  iterations 

si  re 

prob 

c-0 

ol 

c-2 

c-10 

10 

.3 

5/0 

4/4 

3/5 

3/8 

.2 

4/0 

4/4 

4/6 

4/9 

.1 

4/0 

3/3 

3/5 

3/6 

.05 

5/0 

5/5 

4/8 

4/11 

20 

.3 

4/0 

4/4 

4/6 

4/9 

.2 

5/1 

4/4 

4/6 

4/8 

.1 

4/1 

3/4 

3/5 

3/7 

.05 

6/0 

5/5 

5/8 

5/12 

Table  5(d).  Binomial  Runs  (n=100) 
no.  SVD  /  no.  CD  iterations 


size 

prob 

c-0 

c— 1 

c=2 

c-10 

10 

.3 

5/0 

4/4 

3/6 

3/8 

.2 

5/0 

4/4 

4/6 

4/8 

.1 

5/0 

4/5 

4/7 

4/11 

.05 

6/0 

5/5 

4/7 

4/12 

.01 

6/0 

5/5 

5/8 

4/11 

20 

.3 

4/0 

3/3 

3/5 

3/6 

.2 

4/1 

3/3 

3/5 

3/7 

.1 

4/1 

4/4 

3/5 

3/8 

.05 

4/0 

4/4 

3/6 

3/10 

.01 

7/0 

6/6 

5/8 

5/14 

.005 

8/0 

8/8 

7/11 

7/20 

Since  we  know  that  the  estimates  converge  for  fixed  X 
(O'Sullivan,  Yandell  and  Raynor,  Jr.,  1986),  a  few  iterations  for 
fixed  X  may  guard  against  nonlinearity  in  the  penalized  likeli¬ 
hood.  It  is  not  known  at  this  time  what  conditions  are  required 
on  the  penalized  likelihood,  as  a  function  of  X,  to  insure  conver¬ 
gence  in  the  SVD-only  approach. 

If  one  follows  Elden  (1984)  to  stop  the  singular  value 
decomposition  after  the  bidiagonalization,  considerable  time  can 
be  saved  since  the  effort  to  diagonalize  is  magnified  by  the 
number  of  iterations.  Earlier  work  on  GCVPACK  (Bates  et  al„ 
1985)  indicated  that  half  of  the  singular  value  decomposition 
time  may  be  spent  on  bidiagonalization.  Of  course,  once  conver¬ 
gence  is  reached,  one  could  complete  the  diagonalization,  doing 
this  only  once,  to  easily  derive  the  diagonal  of  the  “hat”  matrix. 
Such  a  savings  in  computation  would  further  reduce  the  advan¬ 
tage  of  iterating  via  Cholesky  with  fixed  X. 
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APPENDIX  A.  SYMPOSIUM  ATTENDANTS 
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CSU 
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ALFRED  BALCH 
COLORADO  STATE  UNIV. 
STATISTICS  DEPT. 
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PACIFIC  GAS  AND  ELECTRIC  CO. 

77  BEALE  ST.  RM  1113 

SAN  FRANCISCO,  CA  94106 


MAX  BENSON 
UNIV.  OF  MN 
10  UNIV.  DR. 
DULUTH,  MN  55812 


KENNETH  BERRY 
DEPT.  OF  SOCIOLOGY 
CSU 

FT.  COLLINS.  CO  80523 


RON  BIONDINI 
TRW 

14241  E.  4TH  AVE. 
AURORA.  CO  80011 


DANKMAR  BOHNING 
215  POND  LABORATORY 
UNIVERSITY  PARK,  PA  16802 


CELEDONIO  A.  BRAVO 
COLORADO  STATE  UNIV. 
DEPT.  OF  FOREST  &  WOOD 
FT.  COLLINS.  CO  80523 
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COLORADO  STATE  UNIV-STAT  DEPT 

RM.  100  OLD  ECON.  BLDG. 

FORT  COLLINS,  CO  80523 


PETER  BRYANT 
UNIV.  OF  COLORADO 
1475  LAWRENCE  ST. 
DENVER,  CO  80202 


DAVID  BUNCH 

UNIVERSITY  OF  CALIFORNIA 
308  VOORHIES 
DAVIS,  CA  95616 


DAVID  ALLEN 
UNIV.  OF  KY 
DEPT.  OF  STATISTICS 
LEXINGTON,  KY  40506 


SCOTT  ATKINSON 
UNIV.  OF  WYOMING 
P.O.  BOX  392 S 
LARAMIE,  WY  82071 


DAVID  BALSIGER 
JOINER  ASSOCS. 
P.O.  BOX  5445 
MADISON,  WI  53705 


JIM  BAYLIS 
EASTMAN  KODAK 
BLDG.  C-42 
WINDSOR,  CO  80551 


JON  BENTLEY 
AT&T  BELL  LABS 
RM.  2C-317 

MURRAY  HILL,  NJ  07974 


LYNNE  BILLARD 
UNIVERSITY  OF  GEORGIA 
DEPT.  OP  STATISTICS 
ATHENS.  GA  30602 


THOMAS  BOARDMAN 
COLORADO  STATE  UNIVERSITY 
207  OLD  ECON  BLDG. 

FORT  COLLINS,  CO  80523 


M.T.  BOSWELL 
PSU 

301  POND  LAB 
UNIV.  PARK,  PA  16802 


LEO  BREIMAN 
STATISTICS 

UNIV.  OF  CA  -  BERKELEY 
BERKELEY,  CA  94707 
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BOULDER,  CO  80307 
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